Background & Summary

Neoseiulus longispinosus is an important predatory natural enemy of the family Tetranychidae in agricultural ecosystems1 and is distributed mainly in tropical and subtropical regions of Asia. Predatory mites can prey on pestiferous spider mites, such as Eotetranychus sexmaculatus, Tetranychus cinnabarinus, Oligonychus coffeae, Panonychus citri, Tetranychus neocaledonicus, and Tetranychus truncatus2,3,4,5. They are small and yellow or red in colour. There is strong sexual dimorphism between adult males and females: females have round tails and larger body sizes (1.5 times greater than those of males), whereas males have flat tails. The application of this mite in agricultural production can reduce the use of chemical pesticides and ensure crop quality and safety, highlighting its outstanding application potential. N. longispinosus has been used to control T. urticae in papaya orchards in south Florida, USA3, and Panonychus citri in citrus orchards in Southeast Asia4. This species is one of the most effective predators of spider mites in tropical and subtropical habitats and may become an integral part of biological pest control6. Research on N. longispinosus has thus far focused on laboratory evaluation of its pest-suppression ability, while genomic information for this species remains unavailable. This prohibits the exploration of the basic mechanism of intrinsic population adaptation and the molecular regulation of pest‒predator interactions, as well as the strengthening of its use in the future. Here, we sequenced and analysed the genome of N. longispinosus to better study and utilise predatory mites in controlled applications.

We used a combination of sequencing technologies to integrate the PacBio7, Illumina, RNA-seq, and Hi-C technologies for N. longispinosus. These approaches generated a high-quality chromosome-level genome assembly (Table 1). The final genome assembly spanned 199.21 Mb, 196.39 Mb of which was anchored on four pseudochromosomes, indicating effective chromosome-level scaffolding. The assembly consisted of 54.56 Mb of scaffold N50 and 2.35 Mb of contig N50, with a QV score of 38.69 and a k-mer-based completeness of 90.68%, indicating the high continuity and quality of the assembly. Repetitive elements accounted for 22.38% of the genome. The annotation identified 12,890 protein-coding genes, with an average protein length of 493.2 amino acids. These results provide a foundational genomic resource for N. longispinosus research. Genome information can facilitate the study of the molecular mechanisms of population adaptation and support the future development of large-scale biological control applications.

Table 1 Raw sequencing data.

Methods

Sampling

Neoseiulus longispinosus was collected from soybean leaves in a greenhouse at the China Academy of Tropical Agricultural Sciences, Danzhou, Hainan Province, China. After at least 50 generations of continuous rearing with T. cinnabarinus as prey, a colony of N. longispinosus was used as the source material for genome sequencing. During the sampling process, 2000 eggs were collected and placed into sterile centrifuge tubes, with a total of five tubes being collected. The eggs of N. longispinosus were gently removed from soybean leaves using a fine hairbrush, cleaned of impurities, and subsequently transferred into centrifuge tubes for sequencing.

DNA isolation and sequencing

Genomic DNA was isolated from N. longispinosus eggs using the CTAB method8. DNA concentration was quantified with a Qubit2.0 fluorometer (Thermo Fisher, USA) and a Nanodrop 2000c spectrophotometer (Thermo Fisher, USA) for verification. The quality and integrity of the genomic DNA were assessed by 1% agarose gel electrophoresis. Sample purity was determined using spectrophotometric ratios (260/280 and 260/230). Qualified DNA was fragmented into 15-kb fragments using a Megaruptor instrument (Diagenode B06010001). The resulting fragments were subsequently used for library construction using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Cat. #PN 101-853-100; Menlo Park, CA, USA). Finally, the library was sequenced on a PacBio Sequel II platform. For Illumina sequencing, a 350-bp insert library was constructed utilising a TruSeq DNA PCR-Free Library Preparation Kit (Illumina, Cat. #20015962; San Diego, CA, USA), which was followed by paired-end sequencing on an Illumina NovaSeq. 6000 platform. All library preparation and sequencing operations were performed by Berry Genomics Corporation (Beijing, China).

Genome size estimation

Illumina paired-end reads yielded 11.35 Gb of raw data (Table 2). Quality control and trimming were performed using BBTools (v38.82)9. Adapter contamination was addressed first by trimming with bbduk.sh, followed by the elimination of duplicate reads using “clumpify.sh” from BBTools (v38.82) to ensure high-quality sequencing data. Additionally, bbduk.sh was employed to filter out low-quality bases (Phred score < 20), remove reads shorter than 15 bp, trim homopolymeric tails (A/G/C) exceeding 10 bp in length, and resolve overlaps in paired-end reads through base correction.

Table 2 Raw Illumina sequencing data for N. longispinosus.

Initially, we generated a k-mer frequency distribution from the quality-controlled Illumina reads utilising khist.sh (BBTools) with a k-mer length of 21. Further k-mer analysis was performed using GenomeScope (v2.0)10 with the following specified parameters: -k 21 -p 2 -m 1000. This analysis predicted a genome size of 210.09 Mb (Table 3). It also estimated a heterozygosity rate of 0.78%, which was visualised as a distinct peak to the left of the main coverage peak in the distribution, and repetitive sequences accounted for 20.33 Mb (9.68%) (Fig. 1).

Table 3 Survey Analysis Results.
Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

GenomeScope size estimates for Neoseiulus longispinosus.

Genome assembly

High-quality PacBio HiFi reads, generating 17.03 Gb of sequencing data, were first assembled de novo using Hifiasm (v0.16.1)11 to generate an initial set of contigs. To further refine the assembly and remove heterozygous duplications, first aligned the HiFi reads back to the assembled contigs using Minimap2 (v2.4) to generate a SAM file, which was then converted to the BAM format using SAMtools (v1.15)12 for subsequent processing. This alignment file was then processed with Purge_Dups (v1.2.5)13,14 to detect and remove heterozygous duplicated sequences. Then filtered the refined assembly by discarding any contigs with a sequencing depth below 10 × that were most likely contaminated or erroneous.

A Hi-C library was constructed using an EpiTect Hi-C Kit (QIAGEN, Cat. #59971, Hilden, Germany). The target cells were cross-linked with 1% paraformaldehyde at room temperature for 30 min and then quenched with 0.125 mol/L glycine for 5 min. After cell lysis, the cross-linked chromatin was released and digested using HindIII; the digested fragments were subjected to end repair and biotin labelling and then ligated using Hi-C Ligation Buffer for intramolecular ligation. After reverse cross-linking and DNA purification by protease K digestion, the products were mechanically sheared to a main peak of 350 bp, and the biotin-labelled fragments were enriched using streptavidin magnetic beads. Illumina Adapters were added to complete the library construction. After quality control, the library was sequenced on an Illumina NovaSeq platform (PE150), generating 12.44 Gb of raw data. Chromosome-level assembly of the genome was achieved through the integration of Hi-C data. The Hi-C reads were first processed using Juicer (v1.6.2)15. The deduplicated assembly was then scaffolded using 3D-DNA (v180922)16, with default parameters applied to generate the draft assembly.

To assess potential contamination in the genome assembly results, a blastn-like search was performed using MMseq. 2 (v13)17 with alignments against the NCBI nt and UniVec databases. Genome completeness was evaluated via BUSCO (v5.4.4)18 using the arachnida_odb10 reference dataset. The draft genome assembly was visualised and manually curated in Juicebox (v1.11.08) to fix the assembly errors. The updated scaffolds were reintroduced into 3D-DNA for final anchoring and ordering to achieve chromosome-level genome assembly (Fig. 2). To assess the quality of the de novo assembly, Merqury (v1.3)19 was used to calculate the QV score and k-mer completeness.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Hi-C intrachromosomal contact map of the Neoseiulus longispinosus genome assembly.

The final genome assembly of N. longispinosus was 199.21 Mb, with a GC content of 50.24%, a QV value of 49.4, and a k-mer completeness of 90.68%. This genome assembly contained 220 contigs, which were divided into 73 scaffolds. The contig N50 was 2.35 Mb, and the scaffold N50 was 54.56 Mb. The longest scaffold length was 60.60 Mb. Overall, 98.59% (196.39 Mb) of the assembly was anchored to four pseudochromosomes (Table 4).

Table 4 Genome assembly statistics.

Genome assessment

To evaluate the completeness and base-level accuracy of the assembly, minimap2 was used to realign the PacBio HiSeq sequencing data to the final genome assembly data. SAMtools12 was used to process and analyse the alignment. The results revealed that the average sequencing depth of the whole genome was 85.51 × (Table 5). In addition, the assembly coverage rate of at least one read was 99.69%, indicating a high completion rate. The deep and near-total coverage confirms that the assembly is well supported by the raw sequencing data.

Table 5 Genome Assembly Results and PacBio HiFi Data Coverage Statistics.

The completeness of the genome assembly was quantitatively evaluated using BUSCO (v5.4.4) with the arachnida_odb10 dataset containing 2,934 conserved single-copy orthologues of Arachnida. BUSCO analysis was performed in genome mode for the assembled genome sequences and protein mode for the predicted protein sequences. The target sequences were aligned against the conserved orthologue set, and the matches were classified into complete (single-copy and duplicated), fragmented, and missing categories on the basis of sequence alignment coverage and homology thresholds.

The completeness of the genome assembly was subsequently evaluated and verified by BUSCO using arachnida_odb10 as the database20. The results revealed that C:96.4% [S: 94.0%, D: 2.4%], F: 0.6%, M: 3.0%, and n: 2934 (Table 6).

Table 6 BUSCO Evaluation Results of the Genome Assembly of N. longispinosus.

To evaluate the integrity of the assembly and confirm the effective use of all the generated original sequencing data, we realigned reads from the three sequencing datasets to the final assembly. To this end, the remapping percentages of Illumina short reads, RNA-seq reads and PacBio HiFi long reads were 94.74%, 92.88%, and 98.78%, respectively (Table 7). The uniformly high mapping rate indicates that most of the raw read data are represented and used in the final assembly, confirming its high degree of integrity.

Table 7 Mapping rates of multiplatform raw reads to the reference genome.

Therefore, the high BUSCO score, high remapping rates across multiple data types, and chromosome-level continuity (scaffold N50 = 54.56 Mb) confirm that the genome assembly sample has high accuracy and integrity.

Genomic repeat annotation

De novo identification of repetitive elements was initiated using RepeatModeler (v2.0.4) with an additional long terminal repeat (LTR) structure prediction module (-LTRStruct)21, enabling the detection of LTR retrotransposons based on their characteristic structural features. This analysis generated a species-specific repeat database derived from genome sequence homology analysis and ab initio structural motif prediction. To increase the coverage of known repetitive elements, we focused on Arachnida-specific curated repeats that matched the taxonomic identity of N. longispinosus and expanded the de novo library by incorporating sequences from the Dfam 3.522 and RepBase-2018102623 databases, generating a custom repeat library. This custom repeat library was then used for genome-wide repeat masking via RepeatMasker (v4.1.4)24. A total of 245,257 repetitive sequences (44,582,183 bp) were obtained from N. longispinosus, with a repetitive sequence proportion of 22.38%. The five most common types of repeats were unknown (14.27%), DNA transposons (2.95%), LTRs (1.74%), LINEs (1.35%), and simple repeats (0.64%). The genomic structure and annotation were visualised as a circular map with TBTools (v1.098769)25 (Fig. 3).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Circular visualisation of the genomic features of Neoseiulus longispinosus. Note: From outside to inside, chromosome length (Chr), GC content (GC), gene density (GENE), DNA repeat (DNA), short scattered nuclear repeat (SINE), long scattered repeat (LINE), long terminal repeat (LTR) and simple repeat (Simple) are represented.

Prediction of protein-coding genes

Protein-coding gene structure annotation was performed using MAKER (v3.01.03)26, which integrates three types of evidence to increase prediction accuracy. The detailed data were obtained through the following procedure: (1) Ab initio gene prediction: Gene models were predicted utilising BRAKER (v2.1.6)27 and GeMoMa (v1.8)28, incorporating both transcriptomic and proteomic evidence to broaden the pool of potential coding gene candidates, and the prediction results from both tools were subsequently merged to serve as the input file for MAKER ab initio. (2) Transcript alignment-based gene structure prediction: Transcriptomic data were generated by aligning RNA-seq reads to the reference genome using HISAT2 (v2.2.0)29. StringTie (v2.1.6)30 was employed for reference-guided assembly of the second-generation transcriptomic data. (3) Homology-based gene prediction: We downloaded the well-assembled and annotated protein sequences of closely related species on NCBI for homology comparisons: Dermacentor silvarum (GCF_013339745.2)31, Galendromus occidentalis (GCF_000255335.2)32, Tetranychus urticae (GCF_000239435.1)33, Drosophila melanogaster (GCF_00001215.4)34, and Varroa destructor (GCF_002443255.1)35.

MAKER predicted a total of 12,890 protein-coding genes in the N. longispinosus genome, with an average protein length of 493.2 amino acids. The completeness of the predicted protein-coding genes was assessed using BUSCO, with 96.7% complete genes [93.5% single-copy (S) and 3.2% duplicated (D)], 0.2% fragmented (F), and 3.1% missing (M), based on a set of 2,934 conserved orthologues.

Functional annotation of protein-encoding genes

To infer gene functions, protein sequences were compared against the UniProtKB database using the DIAMOND alignment tool (v2.0.11.149)36 with high-sensitivity settings (–very-sensitive -e 1e-5). Functional characterisation was further enhanced by integrating results from multiple comprehensive databases to identify conserved domains, sequence motifs, and biological pathways. Protein domain architecture was analysed using InterProScan (v5.53-87.0)37, which includes key databases, such as Pfam38, SMART39, Superfamily40, and CDD41, to predict structural and functional domains. Additionally, gene functional classification and pathway enrichment were performed using eggNOG-mapper (v2.1.5) against the eggNOG (v5.0) database42,43 for orthology-based annotation.

Finally, 11,478 genes (89.05%) in N. longispinosus matched the UniProtKB database. InterProScan identified 10,190 genes. InterProScan and eggNOG-mapper identified 9,218 GO terms and 4,813 KEGG terms.

Data Records

The raw sequencing reads and genome assembly have been deposited in the NCBI database under BioProject accession number PRJNA124510144. The WGS (SRR35646020), Hi-C (SRR35646023), HiFi (SRR35646022), and RNA-seq (SRR35646021) data are publicly available under accession number SRP62885245. The complete genome assembly is accessible through NCBI with the accession number GCA_052756005.146. Additionally, the genome assembly and annotation files have been made publicly available via Figshare47.

Technical Validation

Genome assembly evaluation

The completeness of the N. longispinosus genome assembly was evaluated using BUSCO against the arachnida_odb10 database. The assembly exhibited exceptional integrity (96.4%), with 94.0% single-copy genes and 2.4% duplicated genes. This finding indicateed that the majority of conserved core genes are present and correctly represented, indicating high gene content integrity.

The continuity of the genome assembly was assessed through multiple metrics. First, the Hi-C heatmap anchoring rate reached 98.59%, indicating that nearly all the assembled sequences were successfully anchored to chromosome-level scaffolds. Second, the scaffold N50 value was 54.56 Mb, reflecting excellent contiguity and a strong ability to assemble long genomic regions into highly continuous scaffolds.

The accuracy of the assembly was confirmed by the high remapping rates of multiple sequencing datasets. Reads from the HiFi, Illumina, and RNA-seq reads exhibited high alignment rates back to the assembled genome, demonstrating strong consistency between the assembly and original sequencing data. This finding supports the reliability and precision of the final genome sequence.

In summary, the N. longispinosus genome assembly demonstrates high levels of completeness, continuity, and accuracy, resulting in a high-quality, chromosome-level genome resource suitable for downstream evolutionary and functional studies.