Background & Summary

Grassland caterpillars (Lepidoptera: Lymantriinae: Gynaephora) are a small group with 15 recorded species worldwide, distributing in the arctic tundra and high-altitude areas of the northern hemisphere1. In China, all eight nominated Gynaephora species inhabit Qinghai-Tibetan Plateau (QTP) at altitudes of 2900 to 5000 m above sea level (masl), accounting for most parts of the area2. The grassland caterpillars are the most damaging insect pests to the alpine meadow of the QTP. They not only devour forage vegetation, leading to serious feed shortages and grassland degradation, also cause mouth mucous membrane canker in domestic and wild animals1.

The grassland caterpillars are well adapted to the harsh high-altitude environments in the QTP. Morphologically, the larvae are covered with dense black body hair, which can help them to resist high UV radiation and regulate body temperature3. Physiologically, female adults do not develop their wings and antennae in comparison to males, showing significant sexual dimorphism. The optimized allocation of energy to reproduction rather than metamorphosis contributes to increasing fitness4. Genetically, several genes associated with response to hypoxia, energy metabolism and DNA repair, are positively selected in G. menyuanensis and G. alpherakii compared to no-QTP insects5. And sequence variations and expression pattern changes of mitochondrial genes are also associated with adaptation to different high-elevation environments6. The QTP Gynaephora species were proposed to be derived from a common ancestor and the genetic differentiation and speciation occurred in association with QTP uplift and climate changes2,7. The grassland caterpillars distributed at divergent and specific elevations have high levels of genetic diversity, which are promising models for studying adaptive evolution in high-altitude insects8. However, limited genetic information is available for the grassland caterpillars as of now5,7,9.

Here, we present a genome assembly of one grassland caterpillar G. qinghaiensis, which is widely distributed in Tibet, Qinghai Province, Sichuan Province, and Gansu Province of China from 3000 to 4000 masl2. Whole genome is sequenced with Oxford Nanopore long-read and BGI short-read sequencing technologies. Genome assembly is 861.04 Mb in size, with both high completeness (BUSCO score: 99.56%) and continuity (contig N50: 18.65 Mb). The availability of grassland caterpillars reference genome could provide genetic resources to uncover adaptive evolutionary mechanisms of the Gynaephora species to high-altitude environments and contributes to the development of integrated pest management strategies.

Methods

Genome sequencing and assembly

The grassland caterpillar G. qinghaiensis were collected in Yushu County (33°45′N/95°48′E), Qinghai province, China. Total genomic DNA from a male adult was extracted for genome sequencing using a Qiagen DNA purification kit (Qiagen, USA). The genome was sequenced using a combination of short and long-read strategies. Short-read sequencing library was constructed using a Truseq Nano DNA HT Sample Preparation Kit (Illumina, USA), and sequenced on MGISEQ-2000 platform (BGI, China). After quality-filtering using fastp v0.20.010, 28.17 Gb clean short reads were obtained. For long-read sequencing, library was generated using the DNA Ligation Sequencing Kit (SQK-LSK109) (Oxford Nanopore Technologies, England) and sequenced on PromethION platform (Oxford Nanopore Technologies, England). A total of 129.06 Gb Nanopore reads were attained after quality-filtering using Guppy v3.2.2 + 9fe0a7811, which covered 149.9 folds of G. qinghaiensis genome.

Genome size and heterozygosity rate of G. qinghaiensis were inferred by kmerFreq12 using BGI short reads, based on 17-mer frequency distribution. The estimated genome size was 844.04 Mb and the heterozygosity was 1.3%. For de novo genome assembly, genome was initially assembled with NextDenovo v2.3.113 using Nanopore long reads. The preliminary assembly was then polished with NextPolish v1.3.014 using both long reads (three iterations) and short reads (four iterations). Redundant contigs was eliminated using Purge Haplotigs15. The final assembly is 861.04 Mb in size, encompassing 107 contigs, with a contig N50 of 18.65 Mb, and the GC content of 35.23% (Table 1).

Table 1 Statistics of Gynaephora qinghaiensis genome assembly.

Transcriptome sampling and sequencing

For transcriptome sequencing, eggs, first instar larvae, late instar larvae, female pupae, female adults of G. qinghaiensis were collected separately with three replicates. Total RNA was extracted using TRIzol® Reagent (Thermo Fisher, Shanghai, China) in accordance with the manufacturer’s protocol. RNA-Seq libraries were prepared using TruSeq RNA Sample Prep Kit (Illumina, USA) and sequenced on Illumina NovaSeq 6000 platform (Illumina, USA). The reads with adaptors and low-quality reads were filtered using fastp v0.20.010.

Genome prediction and functional annotation

The repeat elements in the genome were identified with a combination of de novo and homology approaches. De novo predictions for long terminal repeat element (LTR) and non-LTR repeat sequences were performed using LTR_Finder16 and RepeatModeler v2.0.117, separately. The results from two softwares combined with RepBase library v2017012718 and Dfam v3.119 database were used as the library for RepeatMasker v4.0.720 to identify and classify repeat elements in genome. The repeat annotation result showed that repetitive elements sequences account for 67.4% of the genome sequence (Table 2). Among them, long interspersed nuclear elements (LINE) (22.37%), Rolling-circles (9.95%) and LTRs (8.21%) represented the three most abundant repeat types.

Table 2 Annotation of repeat elements in the Gynaephora qinghaiensis genome.

Protein-coding gene prediction was performed using a combination of ab initio, homology, and transcriptome-based approaches. For ab initio prediction, Braker2 v2.1.221 were used. For homology-based predictions, we downloaded the protein sequences of nine closely related species including Arctia plantaginis (RefSeq assembly accession: GCA_902825455.1), Bombyx mori (GCF_014905235.1), Helicoverpa armigera (GCF_023701775.1), Manduca sexta (GCF_014839805.1), Spodoptera litura (GCF_002706865.2) and Trichoplusia ni (GCF_003590095.1) from the NCBI database, Euproctis similis (GCA_905147225.2), Lymantria monacha (GCA_905163515.2) and Orgyia antiqua (GCA_916999025.1) from the Darwin Tree of Life Data Portal database (https://portal.darwintreeoflife.org/). The protein sequences were matched with the G. qinghaiensis genome using tblastn v2.13.022 with an E-value cutoff of 1e-5, and the matched proteins were then mapped against the homologous genomic sequences using Exonerate v2.2.023 and GenomeThreader v1.7.124. For transcriptome-based prediction, RNA-Seq data were aligned with genome assembly using Hisat v2.2.125 and gene structures were predicted using Stringtie v2.1.726. The non-redundant reference gene set was generated by integrating genes predicted from three methods using EvidenceModeler v1.1.127. Eventually, 16,618 protein-coding genes were predicted (Table 3). To unravel the functions of protein-coding genes in G. qinghaiensis, we annotated the official gene set against the NCBI non-redundant protein database (Nr, https://ftp.ncbi.nlm.nih.gov/blast/db/) and SwissProt databases (https://www.uniprot.org/) using blastp v2.13.022 with an E-value cutoff of 1e−5. In addition, Protein domains were predicted using HMMER v3.3.228. Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) were predicted using BlastKOALA29 (https://www.kegg.jp/blastkoala/) and PANNZER230 (http://ekhidna2.biocenter.helsinki.fi/sanspanz/), respectively. Eventually, 15,649 genes (94.17%) were successfully annotated by at least one public biological function database.

Table 3 Functional annotation of Gynaephora qinghaiensis proteins.

Phylogenetic analysis

For comparison analysis between different genomes, the longest protein of each gene locus was retained and the orthologous genes across G. qinghaiensis and 13 other lepidopteran insects were identified with Orthofinder v2.5.431. The 13 lepidopteran insects including Amyelois transitella (GCF_001186105.1), Danaus plexippus (GCF_009731565.1), Operophtera brumata (http://v2.insect-genome.com/Organism/590), Plutella xylostella (GCF_905116875.1) and nine former mentioned species, whose protein sets were used for homology-based prediction. Finally, 15,090 orthogroups (OGs) were identified, among which 8,807 OGs were present in all 14 lepidopteran insects and 2,741 were single copy OGs. The 2,741 single-copy OGs were used for phylogenetic tree construction. All protein sequences were aligned with MAFFT v7.123b32 and poorly aligned regions were removed by trimAl v1.4.rev2233. Alignments were concatenated into a supergene sequence and the phylogenetic tree was constructed using the maximum likelihood (ML) method by IQ-TREE v2.1.4-beta34 software with 1000 ultrafast bootstrap replicates. The best-fit substitution model Q.insect + F + R6 was determined by ModelFinder35 implemented in IQ-TREE and the diamondback moth P. xylostella was used as an outgroup36. Phylogeny of 14 lepidopteran insects shows that G. qinghaiensis is a sister taxon to O. antiqua (rusty tussock moth), forming a lineage together with three other Erebidae insects: the yellow-tail moth E. similis, the black-arched tussock moth L. monacha, and the wood tiger moth A. plantaginis (Fig. 1). Divergence times between species were estimated using r8s v1.8137 software. Four calibration points were applied according to a former study: 67–88.6 million years ago (Mya) for Noctuoidea, 70.1–89.9 Mya for Bombycoidea, 105.6–132.1 Mya for Apoditrysia and 154.7 Mya for Ditrysia36. G. qinghaiensis and O. antiqua were estimated to be diverged approximately 18.3 Mya.

Fig. 1
figure 1

Comparative genomics analysis of the Gynaephora qinghaiensis genome. To the left is the maximum likelihood phylogenetic tree displaying gene family expansion and contraction in G. qinghaiensis compared with other 13 lepidopteran species. The green and red numbers indicate the numbers of expanded and contracted gene families, respectively. Branch lengths represent divergence times. All nodes received 100% bootstrap support, except for the nodes in blue, whose bootstrap support value is 90%. To the right is the gene counts for different types of orthologous groups in the genomes. “Single copy” indicates universal one-to-one orthologs present in all species; “Universal” indicates other universal genes; “Species specific” indicates species-specific genes with more than one copy. “Unassigned genes” indicates species-specific genes with only one copy in the genome; “Others” indicates remaining genes.

Gene family expansion and contraction analysis

Gene family expansion and contraction of G. qinghaiensis were analyzed using CAFÉ v5.038 software. The gene family results inferred from OrthoFinder and estimated divergence time by r8s were used as inputs. CAFÉ results suggested that 608 and 535 gene families were expanded and contracted in G. qinghaiensis, respectively (Fig. 1). And 173 (130 expanded and 43 contracted families) gene families were rapidly evolved (P-value < 0.01). The significantly expanded gene families were associated with various molecular functions and biological processes, including, zinc finger BED domain-containing protein 1 (ZBED1), core histones (histone H2A, H2B, H3 and H4), fatty acyl-CoA reductase (FAR), copper chaperone for superoxide dismutase (CCS), (Fig. 2, Table S139). The contracted gene families include UDP-glucuronosyltransferase 2B13 (UGT2B13), membrane-bound alkaline phosphatase-like (mALP), zinc finger CCHC domain-containing protein 3 (ZCCHC3), juvenile hormone esterase-like (JHE-like) and so on. Variations of characterized gene families were illustrated with the Superheat40 R package.

Fig. 2
figure 2

Gene family evolution in the grassland caterpillar Gynaephora qinghaiensis. 15 most significantly (a) expanded and (b) contracted characterized gene families of G. qinghaiensis are illustrated using heatmap plots. The bars on the right depict the variations of gene families in G. qinghaiensis. ZBED1, zinc finger BED domain-containing protein 1; RECQ1, ATP-dependent DNA helicase Q1; CCS, copper chaperone for superoxide dismutase; TIGD6, tigger transposable element-derived protein 6; PRP19, pre-mRNA-processing factor 19; HARBI1, putative nuclease HARBI1; FAR wat, fatty acyl-CoA reductase wat; PGBD4, piggyBac transposable element-derived protein 4; SMC2, structural maintenance of chromosomes protein 2; UGT2B13, UDP-glucuronosyltransferase 2B13; ALP1-like, protein ALP1-like; GVQW3, protein GVQW3; RdDP, RNA-directed DNA polymerase; mALP, membrane-bound alkaline phosphatase-like; ZCCHC3, zinc finger CCHC domain-containing protein 3; JHE-like, juvenile hormone esterase-like; FAR CG5065, fatty acyl-CoA reductase CG5065; Rad54b, fibrinogen silencer-binding protein-like; SCCPDH, saccharopine dehydrogenase-like oxidoreductase; OR13a, odorant receptor 13a; GST1, glutathione S-transferase 1; PGRP-LB, peptidoglycan-recognition protein LB.

Data Records

Oxford Nanopore long-read (SRA accessions: SRR2403211941) and BGI short-read (SRR2403212042) sequencing data for G. qinghaiensi genome are available as NCBI BioProject PRJNA950575. Illumina transcriptomic data for eggs (SRR3103473443, SRR3103474044, SRR3103474145), first instar larvae (SRR3103473146, SRR3103473247, SRR3103473348), late instar larvae (SRR3103472849, SRR3103472950, SRR3103473051), female pupae (SRR3103472752, SRR3103473853, SRR3103473954), female adults (SRR3103473555, SRR3103473656, SRR3103473757) with three replicates are available as NCBI BioProject PRJNA1174259. The genome assembly has been submitted to NCBI under accession number GCA_042920415.158. Gene CDS59, protein60, and genome annotation61 files are deposited in the Figshare database.

Technical Validation

DNA and RNA qualities were assessed by 0.7% gel electrophoresis and Agilent 2100 Bioanalyzer (Agilent, USA), respectively. Only high-quality samples were used for library preparation and sequencing.

BUSCO62 v5.5.0 was used to evaluate the genome assembly completeness with the insecta_odb10 database. 99.56% (99.27% complete, 0.29% fragmented) BUSCO genes were detected in the assembly, with a low duplication rate (1.24%), indicating high completeness of the reference genome (Table 1).