Background & Summary

Eriophyoid mites (Acariformes, Eriophyoidea) are among the largest superfamilies in the Arachnida, comprising over 5,000 name species1,2 and exhibiting a worldwide distribution3. These tiny (~200 um in length, among the smallest arthropods), vermiform to fusiform mites have only two pairs of legs, and are strictly phytophagous, reflecting high hostplant specificity4,5; some of them can cause massive economic losses in agriculture and forestry6.

Despite the need to understand the ecology and evolution among eriophyoid mites, there are no chromosome-level assembled genomes for eriophyoid mites yet. A near chromosome genome assembly has been published for tomato russet mite Aculops lycopersici7, but the lack of high-quality chromosome-level genome resources has limited further comparative genomic analyses among eriophyoid mites.

In this study, we assembled a chromosome-level genome for the Setoptus koraiensis (Eriophyoidea, Phytoptidae) using PacBio long-reads sequencing, Illumina short-reads sequencing, and high-throughput chromatin conformation capture (Hi-C) sequencing. Our assembly resulted in a genome size of 47 Mb across two chromosomes, with scaffold N50 lengths of 24.53 Mb (Table 1). This genome is the first chromosome-level genome among eriophyoid mites, providing significant new data resources for understanding the Eriophyoidea.

Table 1 Statistics of Setoptus koraiensis genome assembly.

Methods

Sample collection

At least 100,000 wild S. koraiensis individuals, including eggs, juveniles and adults, were collected from Pinus koraiensis Siebold & Zucc. (Pinaceae), in Lishui, Nanjing city, Jiangsu province, China (31.3921°N, 118.5417°E). Samples were identified by morphological characteristics with molecular evidence (mitochondrial COI). Vouchers were deposited in the Arthropod/Mite Collection of the Department of Entomology, Nanjing Agricultural University, Jiangsu Province, China.

Genome sequencing

Genomic DNA was extracted from more than 100,000 individuals using MagAttract HMW DNA Kit. The Pacbio 30 kb SMRTbell library was prepared with more than 5 μg gDNA using the SMRTbellTM Prep Kit 2.0 (Pacific Biosciences). The mode of Continuous Long Read (CLR) was run on the Sequel II platform. Illumina whole-genome sequencing was prepared using a 350 bp-insert fragment library (150 bp paired-end) by Truseq DNA PCR-free Kit, which was further sequenced on an Illumina NovaSeq 6000 platform. High-throughput chromosome conformation capture (Hi-C) included cross-linking, HindIII restriction enzyme digestion, end repair, DNA cyclization, purification and capture. The Hi-C library with 300–700 bp insert size library was sequenced on the NovaSeq 6000 platform. Finally, we generated 24.25 Gb (~496X) PacBio long reads, 9.5 Gb (~194X) Illumina short reads, and 9 Gb Hi-C (~184X) reads for our genome assembly.

Genome survey

Duplicate and low-quality Illumina raw reads (base quality < Q20, length < 15 bp, polymer A/G/C/ > 10 bp) were trimmed and removed using BBtools package v38.828. The 21-mer depth distribution was counted using script ‘khist.sh’ of BBtools. Genome Scope v2.09 was used to estimate the genome size and heterozygosity of S. koraiensis with the maximum kmer coverage at 1,000×. Based on the distribution of kmer coverage and frequency, the estimated genome size of S. koraiensis was 45.72 Mb, with a heterozygosity rate of around 1.13% and a repeat content proportion of approximately 3.3% (Fig. 1).

Fig. 1
figure 1

GenomeScope genome size estimates for Setoptus koraiensis.

Genome assembly

The CLR reads were set as input to Flye v2.610 to assemble continuous long reads. One round of built-in long reads polishing was performed by Flye v2.6. Then, two rounds of short reads were used to polish and fill in gaps of the primary assembly with NextPolish v1.4.111. Haplotigs and duplication caused by haplotype divergence were eliminated by Purge_dups v1.2.512 using the alignment program Minimap2 v2.2813. Hi-C reads were aligned to the purged genome using BWA v0.7.1814 and Juicer v1.615 to anchor, order and orient contigs into chromosomal assembly following 3D-DNA16 pipeline. Then, we manually reviewed and corrected assembled errors using Juicebox v2.1717. Contaminations were checked and deleted against the UniVec and NCBI nucleotide databases using BLAST + v2.11.018 and MMseqs2 v1619. The completeness of genome assembly was evaluated by BUSCO version 5.2.220 using the eukaryota_odb10 dataset (creation date 2020-09-10). The reads from the whole genome sequencing were aligned back to the genome assembly to access the mapping rate. After de novo assembly, polishing and contaminant removal, the S. koraiensis genome has a genome size of 49.9 Mb with 565 scaffolds, an N50 length of 24.53 Mb, with 94.2% of assembled genomes anchored to two chromosomes (Fig. 2) resulting in a final genome size of 47 Mb (Table 1).

Fig. 2
figure 2

Genome-wide chromosomal heatmap of Setoptus koraiensis, the blue boxes show super scaffolds.

Genome annotation

The repetitive elements were identified using RepeatModeler v2.0.521, which discovered the complete long terminal repeats (LTR) with the ‘-LTRstruct’ pipeline. RepeatMasker v4.1.622 was searched against the custom repeat library of Dfam 3.823 and Repbase v2018102624 with options ‘-no_is -norna -xsmall -q’ to soft mask repeats of the genome assembly.

For gene structure annotation, we performed a pipeline integrating ab initio and homolog-based methods. Braker v2.1.525 was used to obtain ab initio gene predictions employing GeneMark-ES/ET/EP v4.3326 and Augustus v3.4.027 based on reference proteins from the OrthoDB v11 database28. GeMoMa v1.929 was used for homology prediction with the parameters “GeMoMa.c = 0.4 GeMoMa.p = 10”, and the protein sequences of six species (Aculops lycopersici (GCA_015350385.1), Tetranychus urticae (GCA_039701765.1), Tetranychus piercei (GCA_036759885.1), Panonychus citri (GCA_014898815.1), Pyemotes zhonghuajia (GCA_025170145.1), Blomia tropicalis (GCA_029204025.1)) were provided to assist gene prediction. The results obtained from BRAKER and GeMoMa were combined and provided to MAKER v3.01.0330. The functional annotation of predicted protein sequences was searched against UniProt, InterProScan and eggNOG databases. Diamond v2.1.1031 was used to assign the gene function of the best hits in the UniProt database under the ‘very sensitive’ mode. Gene Ontology (GO) and pathway (KEGG) were annotated using InterProScan v5.7232 and eggnog-mapper v2.1.1233 against Pfam34, SMART35, Superfamily36, CDD37, and EggNOG 5.0.2 database38.

Data Records

The raw reads and genome assembly have been deposited in the NCBI databases under BioProject PRJNA1196018. The PacBio, Illumina, and Hi-C data are available under identification numbers SRR32458739-SRR3245874139. The final chromosome assembly has been deposited at GenBank under the accession number GCA_048013815.140. The mitochondrial COI sequence has been deposited at GenBank under the accession number PV16383341. The genome assembly and annotation files are available in Figshare42.

Technical Validation

We mapped the Illumina sequencing data to the final assembly with BWA v0.7.18, and the mapping rate was 92.9%. We assessed the completeness of the genome assembly using BUSCO v5.4.2 with the ‘eukaryota_odb10’ database, and a total of 89% (83.9% single-copied genes, 5.1% duplicated genes, 5.5% fragmented, and 5.5% missing genes) completed BUSCOs were identified, which is higher than that of A. lycopersici (86.3%). We masked 13.82% (6.42 Mb) repetitive regions of the S. koraiensis genome. Among them, 0.2% of repeat sequences were short interspersed elements (SINEs), 1.29% were long interspersed elements (LINEs), 0.92% were long terminal repeats (LTRs), 1.61% were DNA transposons, and 5.14% were unclassified (Fig. 3). We identified 5,954 protein-coding genes, with 4,770 genes that could be functionally annotated. The BUSCO completeness for protein sequence is 77.3% (71.4% single-copied genes, 5.9% duplicated genes, 3.9% fragmented, and 18.8% missing genes) with the ‘eukaryota_odb10’ database. All evidence strongly supported the completeness and accuracy of S. koraiensis genome assembly.

Fig. 3
figure 3

Circular karyotype representation of the chromosomes of Setoptus koraiensis. Tracks from inside to outside are GC content (GC), density of protein-coding genes (GENE), DNA transposons (DNA), LTR/LINE/SINE retrotransposons (LTR, LINE, SINE), and simple repeats (Simple).