Background & Summary

Phytoseiulus persimilis Athias-Henriot (Acari, Phytoseiidae) stands out as one of the most successful predatory mites in the world. Although extremely small body size, it’s one of the mainstays of integrated pest management for control of pest mites in the greenhouse. P. persimilis is the specialist predator of Tetranychus1,2,3, with strong predation ability4 to T. lintearius5 and T. turkestanit6. Although no single or compound eyes, it can locate prey rapidly and accurately by chemical cues generated by preys or plants induced by phytophagous species7,8. Fast development from egg to adult takes less than a week, and limb regeneration occurs during the early stage of development - from three pairs of legs in larval stage to four pairs after the protonymph stage. In addition, P. persimilis has a distinctive sex regulatory mechanism where the first three offspring are always male, female and female with a strict sex sequence9. A full understanding of these biological traits requires wide-scale exploration in model phytoseiid genomes. Whereas, the lack of high-quality genomic information impedes the advanced research of P. persimilis and other predatory mites10,11,12. In current study, we focus on the assembly and annotation of the high-quality reference genome of P. persimilis to enrich the genome resources of Phytoseiidae. It can pave a path to understand the genetic basis of prey recognition, limb regeneration and sex determination of phytoseiid species.

In this paper, we introduce the new chromosome-level genome of P. persimilis by the PacBio high-fidelity (HiFi) approach (Fig. 1). The finalized assembly spans 214.23 Mb, with a scaffold N50 of 57.95 Mb and a 98.3% completeness based on BUSCO evaluation. Annotation analysis uncovers the 27.59% repeat sequences and 15,847 protein-coding genes in the genome. This comprehensive genomic dataset serves as a valuable resource for further research on P. persimilis.

Fig. 1
figure 1

The workflow of this study. The panes with green, orange, light blue and blue represent the processes of genome assembly, genome size estimation, transcript assembly and genome annotation, respectively.

Methods

Sample collection and genomic DNA sequencing

The colony of P. persimilis has been cultivated for over a decade at the Laboratory of Predatory Mites, Institute of Plant Protection, Chinese Academy of Agricultural Sciences in Beijing, China. P. persimilis was nurtured by cultivating bean seedlings (Phaseolus vulgaris L.) and establishing a predation system involving bean seedlings-T. urticae-P. persimilis. They all were maintained at 25 ± 1 °C, with a humidity level of 70 ± 5% and a photoperiod of L16:D813.

The total genomic DNA was extracted from a collective sample of 800 male and female adult specimens, utilizing the Qiagen DNeasy Blood & Tissue Kit (Germany). Prior to extraction, the specimens were washed by sodium hypochlorite for 3 min to eliminate surface contaminants. For genome survey analysis, we created Illumina short-read DNA libraries (150 bp paired-end, 19.2 Gb, ~90X). A long-read DNA library was constructed from over 5 μg of DNA solution and sequenced on the PacBio Sequel II platform at GrandOmics, Beijing, China (44 Gb, ~205X). To aid in protein-coding gene prediction, total RNA was extracted from various developmental stages of P. persimilis, including egg, larval, protonymph, deutonymph, female adult, and male adult samples. Subsequently, short-read RNA libraries were prepared and sequenced on the Illumina platform (150 bp paired-end, 68.95 Gb).

Genome assembly

The genome size of P. persimilis was determined through k-mer analyses utilizing raw Illumina short-reads and raw PacBio HiFi reads. A k-mer distribution (k = 21) was created using Jellyfish v2.3.114, and the genome size was estimated utilizing Genoscope v1.0.015. Following the 21-mer depth analysis, resulting in an approximate size of 190 Mb with 1.3% heterozygosity (Fig. 2). The assembly of the P. persimilis genome was accomplished using PacBio HiFi reads with Hifiasm v0.19.5-r59316 under default settings, resulting in a draft genome of 228.49 Mb. It comprised 520 contigs, with an N50 length of 59.68 Mb and the largest contig of 63.92 Mb (Table 1). To refine the primary assembly and eliminate redundancy, we employed purdge v1.2.617. Then we used blobtools v1.1.118 to identify and remove bacterial contamination sequences. The accuracy and completeness of the assembly were evaluated using (i) QUAST v5.0.219 and (ii) BUSCO v5.4.720, based on the Arachnia_odb10 lineage data (https://busco-data.ezlab.org/v5/data/lineages/).

Fig. 2
figure 2

Estimation of P. persimilis genome size with 21-mer based on raw Illumina short-reads (a) and raw PacBio HiFi reads (b).

Table 1 Statistics and comparison of P. persimilis genome assembly.

The assembly size closely aligned with the estimated genome size of approximately 190 Mb derived from k-mer analysis. The assembly genome exhibited a high level of completeness, reaching to 98.3%. Out of the 2,934 genes in Arachnia BUSCOs database, 94.6% were identified as complete and single-copy, 3.7% were complete but duplicated, 0.6% were fragmented, and 1.1% were missing (Table 2).

Table 2 The BUSCO assessment of P. persimilis genome assembly compared with GCA_037576195.1 based on 2,934 genes in the Arachnia odb10. GCA_037576195.1 is the GenBank number of P. persimilis.

Hi-C scaffolding

We analyzed the Hi-C reads (accession number: SRR2341037721) and performed quality control by removing low-quality reads and adapters utilizing Trimmomatic v0.3922 (settings: ILLUMINACLIP:TruSeq. 3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36). The clean data were then aligned to the contig assembly utilizing HiC-Pro v3.1.023 to identify invalid pairs, and Yahs v1.124 was utilized to establish interactions. The assembly was manually reviewed using Juicebox v1.11.0825.

In summary, the chromosome-level genome exhibited a size of 190.48 Mb with a scaffold N50 of 57.95 Mb (Table 1). Approximately 88.91% (190.48 Mb) of the bases were successfully anchored onto four pseudo-chromosomes (Fig. 3), which was consistent with the results of karyotypes studies26.

Fig. 3
figure 3

Genome-wide Hi-C data heatmap and circular chromosome representation in P. persimilis genome assembly. (a) The heatmap of chromosome interactions. Colors indicate the frequency of Hi-C links, ranging from yellow (low) to red (high). The black box represents a chromosome. (b) Circos plot of the genomic elements distribution. The tracks indicate (i) length of the chromosome, (ii) gene density, colors indicate ranging from orange (low) to red (high) (iii) GC density, colors indicate the density of GC ranging from light green (low) to dark green (high), and (iv) density distribution of transposable element (TE). The densities of genes, GC, and TEs were calculated in 100 kb windows. Center: photo of a P. persimilis female adult.

Repetitive elements and protein-coding genes annotation

A de novo repeat library was constructed using RepeatModeler v1.0.1127. Subsequently, RepeatMasker v4.1.228 was employed with the filtered de novo repeat library to identify soft-mask repeats in the draft assembly before annotation. In total, 59.10 Mb of repetitive sequences, representing 27.59% of the total assembled genome, were detected. The major repeat elements included SINEs (0.01%), LINEs (2.41%), LTR elements (2.11%), DNA elements (4.65%), Rolling-circles (0.45%), Satellites (0.10%), and Unclassified (16.70%) (Table 3).

Table 3 Summary of repetitive elements identified in the P. persimilis genome assembly.

Following the masking of repeat sequences, a comprehensive annotation pipeline was implemented involving three key strategies: transcriptome-based prediction, homology-based prediction, and ab initio prediction for protein-coding genes. For transcriptome-based prediction, Trinity v2.15.129,30 was utilized to assemble the transcriptome and then predicted protein-coding genes with PASA v2.5.331,32,33,34. Homology-based prediction employed Miniport v0.1335, comparing gene structures with the Arthropoda protein dataset from OrthoDBv11(Bioinformatics Web Server - University of Greifswald, uni-greifswald.de). Ab initio prediction was performed using BRAKER v3.0336,37,38,39,40,41,42,43,44,45,46,47 to generate gene models based on short-read RNA-seq transcriptome data. Subsequently, EVidenceModeler v2.1.032,34 integrated the results from the three strategies to produce a non-redundant gene set, assigning weights as follows: PROTEIN: 3; TRANSCRIPT: 10; ABINITIO_PREDICTION:1. Overall, 15,905 genes were obtained, and 15, 847of the genes were protein coding genes (Table 4). The average gene length was 15,868.01 bp. The mean number of exons, introns, and CDS of each gene were 11.5, 9.8, and 10.6, respectively. The average length of exons, introns, and CDS of each gene were 472.33 bp, 877.58, and 255.34 bp, respectively.

Table 4 Feature annotations of P. persimilis.

Gene functional annotation was carried out by aligning protein sequences were aligned with Non-Redundant protein (NR), Universal Protein (UniProt), and Protein Families Analysis and Modeling (Pfam) databases. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) libraries were also utilized through GhostKOALA (kegg.jp)48 and PANNZER2 (helsinki.fi)49, respectively. Consequently, 12,344 protein-coding genes were successfully annotated for their functions (Table 5).

Table 5 Statistics of P. persimilis functional annotation.

Data Records

The WGS, RNA seq, and PacBio HiFi data for the P. persimilis genome can be found on NCBI with the accession numbers SRR2967425150, SRR29686341-SRR2968634651,52,53,54,55,56, SRR2968644157 respectively under BioProject accession number PRJNA1128535. The genome data also can be found on NCBI (accession number: JBKEIU00000000058) and Figshare59.

Technical Validation

We assessed the quality of the chromosome-level genome of P. persimilis based on three key aspects: continuity, consistency, and completeness. For assessing the continuity of the genome, we determined that the scaffold N50 for P. persimilis is 57.95 Mb (Table 1). For evaluating the consistency of the genome, we analyzed the alignment of Illumina short-reads utilizing BWA 0.7.17-r118860, which indicated that 98.3% of the short reads were successfully aligned to the reference genome. To gauge the completeness of the chromosome-level genome, we utilized BUSCO v5.4.720,61 and referenced the 2,934 genes in Arachnia_odb10. The results demonstrated a high level of completeness, with percentages of 98.3%, 98.4% (Tables 2), and 96.1% for complete genes identified at the contig-level genome, chromosome-level genome, and protein level, respectively. These assessments collectively demonstrate the robust quality of the chromosome-level genome of P. persimilis.