Background & Summary

Liriomyza spp. (Diptera: Agromyzidae) are economically-important polyphagous insects that infest plants in both field and greenhouse conditions1. Originally from the Americas, Liriomyza has spread worldwide. The larvae create tunnels in leaves, and female adults puncture leaf tissue for oviposition. These activities decrease photosynthesis and stimulate leaf drop, which reduces crop quality and yield2,3,4 (Fig. 1).

Fig. 1
figure 1

Development cycle and damage of L. trifolii. (A–C) Different developmental stages of L. trifolii, (A) larva; (B) Pupa; (C) Adult. (D) Damage symptom of L. trifolii.

With the recent expansion of facility agriculture, the damage caused by Liriomyza spp. has become a serious problem. The three polyphagous species, L. trifolii, L. sativae, and L. huidobrensis, are invasive in China5, and recent ecological and molecular studies have shown that L. trifolii is the most competitive of the three species6,7,8. L. trifolii has continued to spread since its initial discovery in China9, but the underlying molecular mechanisms for its dominance among Liriomyza spp. remain unclear. The prevailing control strategy for managing L. trifolii is the use of insecticides10,11,12, which has led to interspecific competition, pesticide resistance and a growing need for more effective control methods8,11. Although genetic approaches for control are promising, high-quality genomic data are greatly needed to understand L. trifolii invasiveness.

This study describes the construction of a high-quality chromosome-level genome of L. trifolii by integrating PacBio high-fidelity (HiFi) and Illumina short reads with high-throughput chromosome conformation capture (Hi-C) data. The deduced genome was comprised of 166 contigs with a combined size of 122.64 Mb and a contig N50 value of 1.66 Mb. Additionally, 118.04 Mb was anchored to five chromosomes, and this resulted in a scaffold N50 value of 23.84 Mb. A total of 11,883 protein-coding genes were deduced, and 95.78% of these were annotated. Furthermore, we detected 20.12 Mb of repetitive sequences, accounting for 16.80% of the genome assembly. This high-quality genome assembly of L. trifolii described in this study provides crucial data for further research on this invasive insect pest.

Methods

Insect samples

The L. trifolii strain used in this study was derived from inbred laboratory strains and was reared on kidney beans under controlled conditions of 26 °C with a 16:8 h (light: dark) photoperiod13. To minimize sequence polymorphisms and achieve a high-quality genome assembly, samples were obtained from a single mating pair, and only newly-emerged adults were selected for sequencing.

Genome sequencing

The QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) was used to obtain genomic DNA from a single surface-sterilized, newly emerged L. trifolii adults and used for both Hi-C and PacBio HiFi sequencing. TRIzol kit was used to extract total RNA from L. trifolii and the purity and integrity of nucleic acids were measured by spectrophotometry and agarose gel electrophoresis, respectively.

The Illumina NovaSeq 6000 platform was used to generate paired-end libraries containing 350-bp fragments and sequenced as recommended by the manufacturer. Low-quality reads and adapter sequences were removed using High-Throughput Quality Control (HTQC) software (version 1.92.310)14. Genomic DNA was randomly cleaved into ~15 Kb fragments using Covaris g-TUBEs (Woburn, MA, USA) and purified with 0.45 × AMPure® PB magnetic beads (Beckman Coulter, Brea, CA, USA). DNA fractions (15–18 Kb) were recovered using the Sage ELF electrophoresis system (Sage Science, Beverly, MA). Primers were annealed to SMRTbell adapters on the DNA template, and Sequel II DNA polymerase was then allowed to bind and initiate sequencing, which was executed using 8 M SMRT cells and the Sequel II System (Biomarker Technologies Co., LTD, Beijing, China). This process yielded 5.87 Gb of circular consensus sequence (CCS) reads with mean lengths of 14.5 kb, resulting in 53 × coverage of the L. trifolii genome. Standard protocols15 were used to construct Hi-C libraries, and these were sequenced on the Illumina NovaSeq 6000 platform, resulting in 11.60 Gb of 150 bp paired-end clean reads.

Assembly of genome and survey of characteristics

A survey of genome characteristics is critical for assessing genome size and heterozygosity. Frequencies of k-mers (k = 19) were obtained and surveyed from Illumina short reads using Jellyfish v. 2.2.10 and GenomeScope v. 2.0, respectively16,17. Using this approach, the predicted size of the L. trifolii genome was 108.87 Mb, with a 30.11% repeat ratio, a 1.44% heterozygosity rate and a 31.27% GC content (Fig. S1).

An initial assembly from PacBio long-reads of the L. trifolii genome was generated with WTDBG2 v. 2.518 using default parameters. After short reads were corrected with Pilon v. 1.2319, the L. trifolii genome was comprised of 166 contigs with a combined length of 122.64 Mb and a contig N50 of 1.66 Mb (Table S1). After removing adapter sequences and low-quality reads, 11.60 Gb of clean data were obtained and mapped to the preliminary L. trifolii genome using the Burrow-Wheeler Transform package v. 0.7.1020 with default settings. Further processing of uniquely aligned pairs was accomplished with HiC-Pro v. 2.10.021, which removes invalid read pairs, including dumped pairs, dangling ends and self-cycles. A sum of 19,398,203 valid interacting pairs were used for scaffold correction to position contigs on chromosomal DNA with LACHESIS v. 2e27abb22 and default settings. A total of 127 sequences were anchored to five chromosomes with a N50 of 23.84 Mb; this encompassed 118.04 Mb and includes 96.25% of the draft genome (Fig. 2; Table S1). Sizes of the five chromosomes ranged from 16.26–39.54 Mb (Fig. 2). Among the sequences mapped to the chromosomes, those with a determined order and orientation spanned 117.60 Mb and accounted for 99.63% of the total mapped chromosomal sequences (Table S2).

Fig. 2
figure 2

Hi-C interactive heatmap (A) and circle genome landscape (B) of L. trifolii. Color indicates the intensity of the interaction signal. The darker the color, the higher the intensity.

Annotation of repeat sequences

Repeat sequences in genomes primarily consist of tandem and interspersed repeats, with transposable elements (TEs) making up most of the latter. The repeat TE sequences in the L. trifolii genome were annotated with de novo and homology-based approaches. First, RepeatModeler v. 2.0.2a23 and LTR_retriever v. 2.824 with default settings were used to customize a de novo repeat library. The predicted repeats were then categorized with the PASTE Classifier v. 1.025 and integrated with the Dfam database v. 3.226 to generate a species-specific, non-redundant TE library. Transposable sequences were detected using homology searching using RepeatMasker v. 4.1023. Using this approach, 20.60 Mb of TE sequences were identified, which is 16.80% of the assembled genome (Table S3). Long terminal repeats (LTRs) were the most represented group of TEs and accounted for 6.92% of the genome, followed by LINEs (long interspersed nuclear elements) at 1.70%. Approximately 0.03% of the genome was populated with short interspersed nuclear elements, and transposons accounted for 8.14% of the entire genome (Table S3). Additionally, 14.90 Mb (12.15%) of tandem repeats were detected with MISA v. 2.127 and Tandem Repeat Finder28 (Table S3).

Gene prediction and functional annotation

Three strategies were implemented for prediction and assessment of protein-coding genes, including initial prediction with Augustus v. 2.429 and SNAP30, homologous species prediction using GeMoMa v. 1.3.131, and unigene prediction based on transcriptome data assembly with PASA v. 2.0.232. Homology-based gene prediction was conducted using protein sequences from four insect species including Bactrocera cucurbitae, Drosophila melanogaster, D. suzukii, and B. dorsalis, which were downloaded from InsectBase 2.033. EVidenceModeler v. 1.1.134 was then used to integrate the sequences into a unified gene set. A total of 11,883 protein-coding genes were annotated in the L. trifolii genome. For functional annotation, the predicted genes were analyzed against multiple databases including KOG (EuKaryotic Orthologous Groups), NR (Non-Redundant), TrEMBL and KEGG (Kyoto Encyclopedia of Genes and Genomes) using BLAST v. 2.2.3135 with a threshold setting of 1e−5. A total of 11,382 genes representing 95.78% of the predicted protein-encoding ORFs were annotated in one or more databases (Table S4). Additionally, 9,671 protein-encoding genes were assigned gene ontology (GO) terms and 9,236 mapped to one or more KEGG pathways (Table S4).

Data Records

The Hi-C, raw Illumina and PacBio HiFi sequencing data for the L. trifolii genome has been deposited in the NCBI Sequence Read Archive (SRA) database as accession number SRP51001036. The final chromosome assembly is available in the GenBank as accession no. JBHGZK00000000037. The genome annotation for L. trifolii has been uploaded to figshare (https://figshare.com/) with the identifier 2612243238.

Technical Validation

Validation of the genome assembly

Three independent methods were employed to evaluate the completeness and accuracy of the L. trifolii genome assembly. First, clean reads from Illumina sequencing were aligned to the genome assembly using Burrow-Wheeler Transform algorithm (BWA)20, and this analysis showed that 98.68% of the Illumina reads were correctly aligned with the genome assemblage. Next, the CEGMA database (e.g., Core Eukaryotic Genes Mapping Approach), which consists of 458 conserved eukaryotic genes, was used to assess the genome, and 100% (n = 458) of the genes were identified in the L. trifolii genome. Finally, genome assembly completeness was evaluated using BUSCO v. 2.516 with the insecta.odb10 database. and results showed that 95.28% (3130/3285) of the conserved BUSCO proteins were present in the L. trifolii genome. Among these, 70.05% were single copy, complete genes, 25.24% were complete and duplicated, 0.33% were fragmented, and 4.38% were not detected.

The quality of the chromosome assembly was further assessed by dividing the genome into 50 kb bins, and the intensity of interaction pairs was used to generate heatmaps. The Hi-C heatmap indicated greater interaction intensity along diagonals as compared to non-diagonal positions for the five distinct chromosomes (Fig. 2). These results demonstrate that the quality of the L. trifolii genome assembly is high.