Background & Summary

The spotted wing drosophila (Drosophila suzukii Matsumura) is an invasive pest characterized by rapid reproduction and a wide dispersal range1,2. Its distinctive serrated ovipositor is both large and robust, enabling it to pierce the skin of ripening and ripe soft-skinned fruits to lay eggs internally3,4. The larval development occurs entirely within the fruit, rendering infestation difficult to detect as no external signs are visible. Feeding by the larvae causes internal damage, often leading to secondary bacterial infections, accelerating fruit decay and severely compromising quality, yield, and market value2,5. Due to its high invasiveness, D. suzukii has emerged as a significant pest with devastating consequences for global fruit production, infesting the ripe fruits of nearly 200 plant species, including economically critical stone fruits and berries6.

D. suzukii was first described in a strawberry field in Japan in 19167, in China in 19378, and was later detected in Hawaii in 19809. Early reports indicated that D. suzukii did not cause significant economic losses and thus did not attract much attention at the time. However, D. suzukii was first reported in Europe and North America in 200810,11. And by the following year, it had caused substantial damage, drawing significant concern. D. suzukii has been detected across the stone fruit-growing regions, spanning from Southern California to British Columbia. It is estimated to cause a 50% yield loss in California raspberries, reducing total revenue by 37%, while processed strawberries could see a 20% drop in total income12. In 2009, it was also reported in several Mediterranean countries, including Spain, France, and Italy13, and it rapidly spread across multiple European countries in subsequent years. In 2010, D. suzukii led to losses of up to 80% in strawberry crops in certain areas of southern France. In Italy, particularly in blueberries, blackberries, and raspberries, losses ranged from 30% to 40%14. Since 2012, the pest has spread widely in South America and Africa13,14,15,16,17. It was first detected in Brazil and Uruguay in 2013 and in Argentina in 201415,18,19. In 2017, it was also reported in Morocco, where approximately 15% of small berries were infested20. Countries in Oceania, such as French Polynesia, have also reported infestations21. Currently, D. suzukii is reported as an invasive pest in over 50 countries worldwide21. Based on the environmental suitability models for D. suzukii, it is predicted that the species may further invade several regions, including Australia22.

D. suzukii has strong fecundity, a wide host range, and high dispersal potential, resulting in significant economic losses in fruit production1,11. High-quality genomic resources are crucial for advancing research to develop novel pest control strategies. A chromosome-level genome sequence of D. suzukii from Japan (NCBI GenBank assembly: GCF_043229965.1) was recently released. However, the analysis methods and data descriptions have not yet be fully detailed in a publication. Therefore, considering the ecological niche, geographic distribution, and invasion strategy of D. suzukii, it is important to resequence and analyze the genome with clearer methodologies.

Here, chromosome-level genome of D. suzukii was constructed by integrating PacBio HiFi, Illumina, and Hi-C datas. Non-coding RNAs, protein-coding genes, and repetitive elements were annotated. Furthermore, we investigated interspecific chromosomal variation between D. suzukii and D. melanogaster. The high-quality genome of D. suzukii serves as a valuable resource for advancing study of the evolution and ecological roles of Drosophila.

Methods

Samples preparation and sequencing

D. suzukii adults were collected in Jun 2018, from a cherry orchard in Yantai (37.41°N, 121.82°E), Shandong Province, China. Male D. suzukii were collected after PBS washing to avoid microbial contamination of body surfaces. The samples were conducted libraries and sequenced at Berry Genomics. Genomic DNA was extracted following CTAB method23. PacBio HiFi libraries with 15 kb insert size were constructed using the SMRTbeⅡ® Express Template Prep Kit 2.0 (Pacific Biosciences, California, USA) for sequencing on BacBio Sequel Ⅱ platform. For Illumina sequencing, Libraries with 350 bp insert size were prepared using the TruSeq DNA PCR-free kit for sequencing on the Illumina NovaSeq 6000 platform. Total RNA was extracted using the TRIzol™ Reagent, and RNA library was performed using the TruSeq RNA v2 kit. The Hi-C library was constructed by Berry Genomics Corporation. We collected a total of 78.56 Gb of sequencing data, comprising 37.22 Gb (236.54×) of Illumina reads, 12.28 Gb (70.05×) of PacBio HiFi reads, 16.31 Gb (103.67×) of Hi-C data, and 12.75 Gb of transcriptome data (Table 1). The PacBio HiFi reads had a scaffold N50 of 7.85 kb and an average length of 7.74 kb (Table S1).

Table 1 Statistics of the sequencing data used for genome assembly.

Genome assembly

Initially, the Illumina sequencing data was processed for quality control using BBTools v38.8224. This process utilized two scripts, “clumpify.sh” and “bbduk.sh,” which were used to remove duplicate reads and perform quality control, including removing bases with quality scores below 20, filtering sequences shorter than 15 bp, and trimming poly-A/G/C tails longer than 10 bp. Subsequently, we conducted a genome K-mer analysis using GenomeScope v2.025 with “-k 21 -p 2 -m 1000”. The predicted genome size was estimated at 157.83 Mb, with a heterozygosity rate of 2.10% and repetitive sequences of 16.08 Mb (10.19%). The red peaks indicated potential contamination within the data (Fig. 1).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Genome survey at 21-mer of Drosophila suzukii estimated by GenomeScope. The vertical dotted lines represent the peaks of different coverages for heterozygous, homozygous, and the duplicated sequences separately.

The initial assembly of PacBio HiFi reads was performed using Hifiasm v0.16.126 with the parameter “-l 2”, which employs an “aggressive” method to remove redundant sequences. To minimize contamination and errors, contigs with sequencing depths exceeding 10 × were retained. To further improve assembly accuracy, Purge_Dups v1.2.527 was employed to remove heterozygous sequences, using BAM files generated by Minimap2 v2.2428 for alignment. After quality control with Juicer v1.6.229, Hi-C reads were assembled using the default parameters of 3D-DNA v18092230, followed by manual correction of the assembly with Juicebox v1.11.0829. Similarity searches were performed using MMseqs2 v1331 to identify potential contaminants. These searches targeted the NCBI nucleotide and UniVec databases with a sequence identity threshold of 0.8.

We obtained chromosome-level genome assembly of D. suzukii, which is 157.35 Mb in size and comprising 9 scaffolds and 209 contigs. The longest scaffold measured 33.65 Mb, while the longest contig was 7.75 Mb. The N50 values were 25.66 Mb for scaffolds and 1.74 Mb for contigs (Table 2). The assembly exhibited a GC content of 40.67%. Of these, 136.24 Mb (86.59%) were successfully anchored to 6 pseudo-chromosomes (Fig. 2). BUSCO analysis indicated a completeness score of 98.1%, including 97.4% single-copy, 0.7% duplicated, 0.1% fragmented, and 1.8% missing BUSCOs. The mapping rate for Illumina reads was 95.15%, while PacBio reads had a rate of 99.21%.

Table 2 Genome assembly statistics for Drosophila suzukii.
Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Genomic heatmap. Chromosome-level heatmap of Drosophila suzukii, with individual chromosome outlined in blue.

Genome annotation

First, RepeatModeler v2.0.432 was utilized, with an LTR search process, to creat a de novo repeat library based on the structural features of repetitive sequences and ab initio predictions to D. suzukii. The resulting repeat library was combined with Dfam 3.533 and RepBase-2018102634 to generate a comprehensive repeat database. RepeatMasker v4.1.435 was used to predict repeat sequences based on the constructed database. The results revealed 147,215 (27.39 Mb) repetitive sequences, representing 17.41% of the genome. The categories of repetitive sequences with more prevalent were LTR transposons (5.41%), LINE elements (3.15%), simple repeats (2.03%), DNA transposons (1.81%), satellite (1.21%) and unclassified (0.51%) (Table 3).

Table 3 Genome assembly and annotation statistics of Drosophila suzukii.

Protein-coding genes were annotated using MAKER v3.01.0336, which combined ab initio predictions, transcriptome data, and homologous proteins. Ab initio predictions were generated by BRAKER v2.1.637 and GeMoMa v1.838, integrating transcriptomic and protein evidence. The combined predictions were used as input for MAKER. The alignment of transcriptome data to genome was performed using HISAT2 v2.2.039, generating BAM alignment files. BRAKER trained two ab initio tools, Augustus v3.3.440 and GeneMark-ES/ET/EP 4.68_3.60_lic3441, and integrated arthropod protein sequences from the OrthoDB10 v142 database to improve prediction accuracy. Gene structures were predicted by aligning transcript sequences to the genome. StringTie v2.1.643 was used for reference-based transcriptome assembly, utilizing BAM files generated by HISAT2 v2.2.039. Gene prediction was performed using GeMoMa v1.8, which leveraged protein homology and intron position information, with the parameters set as “GeMoMa.c = 0.4, GeMoMa.p = 10”. Homology-based predictions were conducted using protein sequences and annotation files from closely related species, including Anopheles gambiae (GCF_000005575.2), Aedes aegypti (GCF_002204515.2), Drosophila melanogaster (GCF_000001215.4), Drosophila simulans (GCF_016746395.2), and D. suzukii (GCF_037355615.1). We identified 14,742 protein-coding genes, with an average gene length of 5,323.2 bp (Table 4). Each gene contained an average of 4.4 exons (mean length: 474.8 bp), 3.4 introns (mean length: 1,161.5 bp), and 4.2 coding sequences (mean length: 390.2 bp). BUSCO analysis of the predicted protein-coding genes, using insecta_odb10 dataset (n = 1,367), indicated a completeness score of 98.0%.

Table 4 Protein-coding genes annotation for Drosophila suzukii.

Gene function annotation was compared against the UniProtKB (SwissProt + TrEMBL) database using Diamond v2.0.11.14944 (‘–very-sensitive -e 1e-5’). Signaling pathways (KEGG, Reactome), gene Ontology (GO) terms, protein domains, and other functional annotations were identified using InterProScan 5.53–87.045 to search Pfam46, SMART47, Superfamily48, and CDD49 databases. EggNOG-mapper v2.1.550 was utilized to query the eggNOG v5.051 database. The final functional annotations were produced by integrating the results from all analyses. Out of the total genes, 14,236 (96.57%) were matched to records in the UniProtKB database, and protein domains were identified for 11,559 genes using InterProScan. Combined InterProScan and eggNOG-mapper results identified 11,018 GO terms and 4,816 KEGG pathways, providing annotations for 10,239 genes with GO terms, 7,816 with KEGG KO terms, 2,730 with enzyme codes, 4,816 with KEGG pathways, and 11,949 with COG functional categories (Table 4). Genomic characteristics were visualized using TBtools v1.09876952 (Fig. 3).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Genomic features of Drosophila suzukii. Each circle from inside to outside represents simple repeats (Simple), long terminal repeats (LTR), long (LINE) and short (SINE) interspersed nuclear elements, DNA transposons density (DNA), GC content (GC), gene density (GENE), and chromosome length (Chr).

Non-coding RNA sequences were identified by comparing against Rfam v14.10 database53 using Infernal v1.1.454. Transfer RNAs were identified using tRNAscan-SE v2.0.955, and low-confidence tRNAs were filtered using the “EukHighConfidenceFilter” script. A total of 726 non-coding RNAs were identified, including 112 ribosomal RNAs, 104 microRNAs, 125 small nuclear RNAs, 327 transfer RNAs, 2 ribozymes and others. Small nuclear RNAs included 34 spliceosomal RNAs, 3 minor spliceosomal RNAs, 8 C/D box snoRNAs, and 30 HCA-box snoRNAs (Table S2).

Chromosome synteny

MMseqs2 was utilized to investigate interspecific chromosomal evolution between D. suzukii and D. melanogaster, applying an e-value threshold of 1e-5. Syntenic blocks were identified using MCScanX56 with the parameter ‘ -s 5’, requiring a minimum of five genes per block to define collinearity. Visualization of the syntenic relationships was performed using TBtools. Comparative genomic analysis revealed substantial interchromosomal alignment between D. suzukii and D. melanogaster, with most chromosomes exhibiting a high degree of shared features. However, instances of chromosomal fusion and fragmentation were also observed. There are also interchromosomal inversion events, primarily involving Dmel4 and Dsuz4 (Fig. 4).

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Chromosomal synteny blocks among Drosophila suzukii and Drosophila melanogaster.

Compared to the previous genome version (Genbank: GCA_043229965.1), the genome we obtained exhibits a higher chromosome anchoring rate (Table 5). Chromosome synteny analysis revealed that chromosome fusions were present in the previous version (Fig. S1), likely caused by assembly errors.

Table 5 Comparative of genome assembly and annotation results between two versions of Drosophila suzukii.

Data Records

Raw PacBio, Illumina, Hi-C, and transcriptome sequencing data of D. suzukii have been deposited in the National Center for Biotechnology Information (NCBI) with accession numbers SRP55537957. The assembled genome has been submitted to the NCBI assembly with the accession number JBNGIB000000000.158. The results of annotation for repeated sequences, gene structure, and functional prediction have been made available in Figshare59.

Technical Validation

Two methods were used to evaluate the completeness of the genome assembly. First, BUSCO v5.0.460 with the arthropod reference gene set (n = 1,367) were used to assess the genome assembly. The final assembly achieved a BUSCO completeness score of 98.1%. Second, assembly accuracy was evaluated by calculating the mapping rate. The mapping rate of PacBio reads was 99.21%, while that of Illumina reads was 95.15%.