Background & Summary

The hawthorn spider mite, Amphitetranychus viennensis (Arachnida, Acari, Acariformes, Trombidiformes, Tetranychidae), one of the most destructive pests affecting various host plants within Rosaceae family, distributes mainly in Europe and Asia1,2,3. A. viennensis feeding on plant juice from leaves and young buds causes yellow spots, leaf curling, defoliation, and ultimately a decrease in the photosynthetic capacity of plants2. Synthetic chemicals have been extensively utilized for controlling A. viennensis, consequently leading to the development of resistance in A. viennensis to nearly all commercially available acaricides4,5,6,7. The development of pesticide resistance, combined with pesticide residues in both food products and the environment6,8, has led to search for alternative pest management strategies, including the emerging biotechnology of RNAi9,10,11,12,13,14. Genome information is helpful for managing pesticide resistance and developing control strategies to tackle agricultural pests. Lack of genomic resources in A. viennensis becomes a limiting factor for such efforts. To enhance the management of pesticide resistance, as well as to facilitate the incorporation of RNAi into the existing integrated pest management strategies, we have sequenced the genome and transcriptomes of A. viennensis.

In this paper, we aim to assemble a high-quality chromosome-level genome of A. viennensis through combined application of Illumina, PacBio sequencing, and Hi-C data. The genome assembly consisted of 243 contigs with a total length of 141.96 Mb, of which the contig N50 was 1.31 Mb. In addition, 97.27% of the draft assembly was anchored to 3 chromosomes with a scaffold N50 of 45.83 Mb. A total of 13968 protein-coding genes were obtained, of which 94.16% were annotated. We also identified 16.70 Mb of DNA repeats, accounting for 11.78% of the genome assembly. The high-quality chromosome-level genome assembly of A. viennensis will provide a genetic basis for further research on this pest mite.

Methods and Results

Sample collection

The hawthorn spider mite was collected from crabapple Malus ‘Radiant’ tree in Tai’an City, Shandong Province, China (117.1194°E, 36.1964°N), and was subsequently reared with fresh peach, Prunus davidiana, in climate incubators at a temperature of 26 ± 0.5°C with a relative humidity of 50% and a photoperiod of 16 L: 8 D. Ten generations were reared and the eggs of the tenth generation were collected following the procedures in Tetranychus urticae15 for sequencing.

Library construction and genome sequencing

Genomic DNA was extracted from eggs using CTAB methods for Illumina, PacBio and Hi-C sequencing to prevent contamination from other individuals and microorganisms. Total RNA was extracted from a mixed sample containing mites across all developmental stages (eggs, larvae, adult females, and adult males) using the TRIzol reagent, followed by transcriptome library preparation and sequencing. The purity and integrity of genomic DNA and RNA were validated by the NanoDrop 2000C spectrophotometer (Thermo, Wilmington, DE, USA), and further assessed through fragment analyzer and 1.5% agarose gel electrophoresis, respectively.

For next-generation sequencing, the high-quality genomic DNA was fragmented into target fragments (350 bp) using ultrasonic shock. Subsequently, these fragments were utilized to construct short-read sequencing libraries. After removing adapter sequences and low-quality reads, a total of 32.47 million clean reads (Table 1) were obtained for subsequent analyses. For PacBio sequencing, PCR-free Single-Molecule Real Time (SMRT) library constructed following the manufacturer’s standard instructions subsequently was sequenced on a PacBio Sequel II platform (Berry Genomics Company, Beijing, China)16. Through quality control, 1,834,436 reads were obtained in total, with an average read length of 10.083 kb (Table 1). The Hi-C library constructed through standard instructions were performed on the Illumina Novaseq 6000 platform by Berry Genomics Company (Berry Genomics Company, Beijing, China), and 69.29 million reads of 150-bp paired-end clean reads were obtained (Table 1). RNA-seq libraries for transcriptome sequencing were constructed by Biomarker Technologies (Beijing, China) and then sequenced on the Illumina HiSeq 2000 platform, yielding 10.32 Gb of 150-bp paired-end reads.

Table 1 Statistics of sequencing data of A. viennensis genome.

Genome survey and assembly

The main genome characteristics, including genome size, repetitive sequence content and heterozygosity, were essential to estimate before assembly. The k-mer (K = 19) frequencies were constructed based on Illumina clean short-reads using Jellyfish v1.1.1117. The estimated genome scale of A. viennensis was 157.65 Mb, with a heterozygosity of 0.317% (Fig. 1A). Subsequently, the draft genome was assembled using PacBio SMRT raw reads by Falcon software (length_cutoff = 8000, length_cutoff_pr = 12000)18,19. To further improve the quality and accuracy of the genome assembly, we corrected the genome by long-read and short-read polishing with Arrow (SMRT link version v5.0.1) and Pilon v1.16 with default parameters20,21. Then redundant heterozygous contigs were removed via Redundans v0.14a with default parameters22. As a result, we generated a 141.96 Mb genome assembly consisted of 243 contigs with the contig N50 of 1.31 Mb. For chromosome-level assembly, 69,288,927 Hi-C clean reads were obtained after filtering out sequences containing ≥3 unidentified nucleotides (Ns), adapter contamination, or ≥20% low-quality bases (Phred score ≤ 5), using default parameters. To verify the absence of exogenous contamination, 10,000 randomly selected clean reads were aligned against the NCBI non-redundant nucleotide database (NT, 2023 release) using BLASTN v2.12.0 (E-value cutoff: 1e−5). The full set of Hi-C clean reads was then aligned to the draft genome assembly using Juicer v1.6.223,24. Chromosomal scaffolding was performed with 3D-DNA v180922 using default settings, and manual corrections were made by inspecting Hi-C contact heatmaps in Juicebox v1.9.8 to resolve potential misassembles25,26. Finally, 47 scaffolds were anchored to 3 chromosomes (Fig. 2A,B) with a scaffold N50 of 45.83 Mb, covering a span of 141.96 Mb and representing 97.27% of the draft genome assembly.

Fig. 1
figure 1

Genome scope profiles of 19-mer analysis and GC content and depth distribution of A. viennensis genome. (A) Genome scope profiles of 19-mer analysis; (B) GC content and depth distribution of A. viennensis genome.

Fig. 2
figure 2

Genome assembly of A. viennensis. (A) Heatmap of genome-wide Hi-C data. The heatmap shows all interactions between 3 chromosomes, the frequency of interaction links is represented by the color, which ranges from yellow (low) to red (high), the redder, the higher of the intensity; (B) Overview of the genomic landscape of A. viennensis. Blocks on the outmost circle represent all 3 chromosomes, peak plots from outer to inner circles in green, red, and purple represent gene distribution, distribution of repeating elements, GC content and inter-chromosomal collinearity, respectively.

Assessment of assembly completeness was generated using BUSCO v3.0.2b27. The results showed that 91.6% of BUSCO genes could be successfully detected, of which 82.1% are single-copied and 9.5% are duplicated (Table 2).

Table 2 Completeness Assessment of A. viennensis genome assembly.

Genomic repeat annotation

The prediction of repeat elements has been carried out using a combination of de novo and homology-based methods28,29,30. The detailed workflow was as follows: (1) MITE-Hunter v1.0 was used to perform de novo prediction of miniature inverted-repeat transposable elements (MITEs) in the A. viennensis genome assembly to construct a MITE library31. (2) LTRharvest and LTR Finder v1.07 were used to de novo detect long terminal repeat (LTR) sequences, and LTR_retriever v2.9.0 was employed to integrate the results and build an LTR library32,33,34. (3) RepeatMasker v4.1.1 was used to identify conserved repetitive elements by performing a homology-based search against the RepBase database (release 20181026)35,36. (4) The known repeats identified in step (3) were merged with the MITE and LTR libraries to create a custom repeat library, which was then used with RepeatMasker v4.1.1 to mask repetitive regions in the genome. (5) Finally, RepeatModeler v2.0.2a was employed to de novo identify additional repetitive sequences in the genome after the initial masking37. Ultimately, we identified 47.81 Mb interspersed repeats and 3.16 Mb tandem repeats. Among classified interspersed repeats, DNA transposons were the most abundant with a whole length of 16.70 Mb (Table 3).

Table 3 Statistics of repeat elements of A. viennensis genome.

Gene prediction and functional annotation

Three approaches, including de novo prediction, homolog-based and transcriptome-based methods were combined to perform gene prediction after eliminating the interference of repeat sequences in A. viennensis genome. The de novo gene models were predicted using software tools of AUGUSTUS v3.5.0, SNAP (http://snap.stanford.edu/snap/download.html), Glimmerhmm v3.0.4 and GeneMark-ET v4.32 in the repeat-masked genome38,39,40,41. Homology-based gene prediction was conducted using GeMoMa v1.9 against the protein sequences of Tetranychus urticae (GCF_000239435.1), Ixodes scapularis(GCF_016920785.2), Galendromus occidentalis (GCF_000255335.2), and Sarcoptes scabiei (GCA_020844145.1) downloaded from GenBank and then exon and intron boundary information was obtained through the comparison between the transcript and the genome42. For transcriptome-based gene prediction, HISAT v2.2.1 was used to align RNA-Seq reads to the genome sequence and Cufflinks v2.2.1 was used to assemble transcripts for obtaining the full-length transcript sequences43,44. PASA software v2.5.2 was used to predict the open reading frame based on the obtained full-length transcript sequence45. EvidenceModler (https://sourceforge.net/projects/evidencemodeler/) was used to integrate the above prediction results, and untranslated region (UTR) and other variable cut annotation was predicted by PASA v2.5.246. As a result, 13968 protein-coding genes with a mean coding sequence length of 1613 bp were identified from A. viennensis genome (Table 4). All protein-coding genes were aligned to three integrated protein sequence databases: NR, SwissProt, eggNOG. Protein domains were annotated by InterproScan v5.66–98.0 and the Gene Ontology (GO) terms for each gene were obtained from the corresponding eggnog-mapper annotation entry47. The pathways in which the genes might be involved were assigned by BLAST v2.12.048 against the KEGG databases. The protein-coding gene functional annotation results were merged from the above methods. Finally, 94.16% (13152/13968) of protein-coding genes were annotated (Table 5). Furthermore, 8617 genes were assigned with GO terms and 8158 genes were mapped to at least one KEGG pathway.

Table 4 Gene prediction of A. viennensis genome.
Table 5 Statistics of functional annotation in A. viennensis genome.

For the annotation of non-coding RNA tRNA was annotated by tRNAscan-SE v2.0.11. The Rfam database was used to annotate other types of ncRNA by BLAST v2.12.0 software.

Data Records

The Illumina, PacBio and Hi-C sequencing data that were used for the genome assembly and annotation have been deposited in the NCBI Sequence Read Archive with accession number SRP50018649. The final chromosome assembly has been deposited at GenBank under the accession number GCA_050437165.150. Genome annotation file and RNA-seq data are available at the Figshare database51.

Technical Validation

We comprehensively evaluate the quality of genome assembly by evaluating the quality of the sequencing data, the completeness of the assembly, and the correctness of the assembly: 1) Upon aligning clean reads from genomic sequencing to the genome assembly with BWA, an impressive data response ratio of 98.99% was achieved. This high alignment rate indicates robust mapping of the majority of sequencing reads to the reference genome, highlighting the efficacy of the analysis; 2) no scatter clusters appear in the depth distribution map of GC content (Fig. 1B), indicating that there is no contamination in the splicing results; 3) the BUSCO evaluation showed that 91.6% of BUSCO genes (single-copy gene: 82.1%, duplicated gene: 9.5%) were successfully identified in genome assembly, indicating that the genome assembly was very complete; 4) the Hi-C heatmap revealed a well-organized interaction contact pattern along the diagonals within/around the chromosome inversion region, which indirectly confirmed the accuracy of the chromosome assembly.