Background & Summary

The water area on Earth accounts for the majority, and many insects have adapted to the aquatic ecosystems. Aquatic insects live in water during one or more phases of their life cycle, such as dragonflies and mayflies, which both lay eggs in water, and only after they have developed and matured do they move onto land1,2. These aquatic insects play important ecological roles as primary consumers, detritivores, predators, and pollinators in both aquatic and terrestrial ecosystems. Aquatic insect groups evolved from terrestrial ancestors through adaptation to freshwater ecosystems3. Freshwater ecosystems exhibit a remarkable diversity of habitats, including ponds, lakes, and ditches. Colonization of these environments required significant evolutionary adaptations in various physiological and behavioral mechanisms, including thermo- and osmoregulation, respiration, and feeding strategies. However, research on insect aquatic adaptation remains limited, the mechanisms driving physiological, morphological, and behavioral changes during their transition to aquatic environments are poorly understood.

The Chrysomelidae, commonly known as “leaf beetles”, comprises approximately 38,000 described species4. The majority of living leaf beetles feed on angiosperms, and many leaf beetles are considered agricultural pests, such as Colorado Potato Beetle (Leptinotarsa decemlineata), western corn rootworm (Diabrotica virgifera), striped flea beetle (Phyllotreta striolata) and coconut leaf beetle (Brontispa longissima). D. provosti (Fairmaire, 1885) (Coleoptera: Chrysomelidae) is a damaging pest of aquatic crops, first recorded in Beijing, China, in 1885. The species distribution includes Russia, South Korea, Japan and China, yet is rapidly spreading globally. In China, it is found from Hainan to Heilongjiang, with significant infestations in Hubei and Jiangsu. D. provosti primarily feeds on lotus and rice, causing notches, holes, and epidermal damage to lotus leaves. Lotus stems and roots are damaged by the larvae, resulting in dark brown spots, decay, and stunted growth, and could make the plant susceptible for fungal infections. In the 2000s, it caused 15%–20% losses in lotus root production in China5. Current control relies on costly chemical pesticides that pollute the water, and pose a threat to human and livestock health. There is an urgent need for environmentally friendly pest management strategies to suppress this pest.

With the advancement of sequencing technologies, their application in entomological research has become increasingly prevalent, and several leaf beetle genomes have now been sequenced. This study presents the first whole-genome sequencing of D. provosti, yielding a high-quality chromosome-level reference genome, and it would provide a valuable resource for future investigations into its ecological adaptations and the development of pest control measures.

Methods

Insect collection and genomic sequencing

Approximately 200 samples of D. provosti were collected from Enshi city(30.25°N,109.05°E), Hubei province, and subjected to a 24-hour laboratory starvation period to minimize contamination of gut content. These samples were subsequently washed with double-distilled water (ddH2O) and ethanol to remove external contaminants, followed by immediate flash-freezing in liquid nitrogen. These samples were then transferred to a −80 °C freezer for long-term storage. Genome sequencing of a female adult was performed using PacBio Revio System with SMRTbell Express Library Prep Kit, generating ~170 Gb HiFi reads and achieving an N50 of ~14 Kb. For Illumina genome sequencing, three short paired-end DNA libraries with a 400-bp insert size were constructed using the TruSeq DNA PCR-Free Library Prep Kit (Illumina) according to the manufacturer’s instructions and sequenced on an Illumina novaseq xplus platform. The total RNA was extracted from three adults, and three short paired-end libraries with a 400-bp insert size was constructed and sequenced on an Illumina novaseq xplus platform. All sequencing work was performed at Berry Genomics Corporation.

Genome assembly and quality assessment

Genome size estimation was conducted through k-mer frequency analysis of PacBio HiFi reads using Jellyfish (v2.1.3)6. The k-mer counting process was executed using the following Jellyfish command: jellyfish count -m 17 -C -s 100 M -t 20, which revealed an estimated genome size of 1.7 Gb.

The PacBio reads were assembled using Hifiasm (ve0.20.0-r639)7 with parameters: -l 3, generating an initial assembly comprising 4,990 contigs with a total length of approximately 2.2 Gb and a contig N50 of 21 Mb. To eliminate redundant sequences, the assembly was further processed using purge_dups (v1.2.3)8 with parameters set to -2 -a 50. This purification step yielded a refined primary assembly with a total length of 1.76 Gb and an improved contig N50 of 27 Mb (Table 1).

Table 1 Major indicators of the D. provosti genome.

Genome scaffolding

We sequenced an adult sample, generating approximately 150 Gb of Hi-C paired-end reads. Raw reads underwent quality control using fastp (v0.23.1)9, resulting in a Q30 score of 92.3%. Subsequently, these reads were aligned to the reference genome using Bowtie2 (v2.4.1)10. Valid chromatin interactions were identified and filtered (removing multiple hits and singletons) using HiC-Pro (v2.11.0)11. Finally, contigs were anchored and oriented into 15 scaffolds (Fig. 1a,b) with YAHS (v1.2.2)12, which correspond to the 15 actual chromosomes13. This chromosome-level genome assembly achieved a scaffold N50 of 71 Mb, with the longest contig measuring 127 Mb and the shortest at 55 Mb. The circos plot was drawn by TBtools (v2.326)14.

Fig. 1
figure 1

Heatmap of genome-wide Hi-C data and circular representation of the chromosomes of D. provosti. (a) The heatmap of chromosome interactions in D. provosti, with densities calculated in 500 Kb windows. The frequency of Hi-C interaction links is represented by colours, which ranges from yellow (low) to red (high); (b) Circular representation of the chromosomes. Two tracks represent the distribution of gene density (line plot), TE density (bar plot), with densities calculated in 10 Kb windows, and TEs with overlaps in the RepeatMasker results were removed. Both gene density and TE density are ranging from 0 to 100%.

Genome annotation

The annotation of Transposable Elements‌(TEs) was performed through a comprehensive repeat analysis pipeline. Initially, a de novo repeat library was constructed using RepeatModeler (v2.0.1)15 with the NCBI BLAST engine as the search algorithm (-engine ncbi). Subsequently, RepeatMasker (v4.0.5)16 was employed to identify TEs by integrating both the de novo repeat library and TE databases (RepBase 20170127). Finally, A total of 1.35 Gb (76.6% of the 1.76 Gb genome) repeat sequence was identified (Table 1). While automated annotation methods can be limited in their precision, particularly for genomes with high TEs densities, and coupled with the lack of curated insect TEs databases, this can introduce biases in TEs annotation.

To ab initio predict coding genes, we utilized repeat-masked genome sequences with AUGUSTUS (v2.7)17. For homology-based prediction, protein sequences of Chrysomelidae species were retrieved from the NCBI and UniProt databases. These sequences were subsequently mapped to the genome using exonerate (v2.4.0)18, and incomplete gene models which lacking both start and stop codons were filtered out and removed. Quality-controlled reads from RNA libraries were mapped to the genome using Bowtie2 (v2.4.1)10. StringTie (v2.1.1)19 was then employed to construct gene prediction models based on these alignments. Finally, the gene predictions generated through these three approaches were integrated using the EVidenceModeler (v1.1.1)20. The transcriptomic evidence was weighted by a factor of 10, ab initio evidence by a factor of 4, and homologous annotation evidence by a factor of 1. A total of 20,130 protein-coding gene models were predicted (Table 1). Functional annotation was performed by aligning all genes to these four databases, NCBI-NR, KEGG, InterPro, and GO database with diamond (v0.9.19)21 and interproscan (v5.1)22.

Phylogenetic analysis

Five additional Chrysomelidae species (Table 2), Acanthoscelides obtectus, Diabrotica virgifera, Diorhabda carinulata, Phaedon cochleariae, Phyllotreta striolata were used to infer orthologous genes, and Tribolium castaneum was select as outgroup. Genome assembly and annotation of these five Chrysomelidae species A. obtectus23, D. virgifera24, D. carinulata25, P. cochleariae26, P. striolata27, and T. castaneum28 was download from NCBI genome database. Orthologous gene clusters were identified using the OrthoFinder algorithm. Subsequently, species phylogenetic tree were constructed based on conserved single-copy gene sequences using FastTree (v2.1.10)29 with the JTT + CAT model. Divergence times were then estimated, and gene family expansion and contraction were analyzed. All gene family-related analyses and visualizations were further facilitated by OrthoVenn330. The calibration times for divergence estimation were set according to data from TimeTree31, with a minimum of 152 million years ago (MYA) and a maximum of 236 MYA for T. castaneum and A. obtecta. The results revealed that D. provosti diverged from its closest relative, A. obtectus, approximately 0.6 MYA, and T. castaneum around 2.3 MYA (Fig. 2a). Through the analysis of gene family expansion and contraction, we identified 38 expanded gene families and 406 contracted gene families in D. provosti. GO enrichment analysis of the expanded gene families, which conducted utilizing the OmicShare tools32, revealed significant enrichment in vision-related and immune-related functions, such as response to blue light (GO:0009637) and mannan binding (GO:2001065) (Fig. 2b). GO enrichment analysis for the contracted gene families showed that only one GO term was found to be significantly enriched (GO:0003008, system process).

Table 2 Genomic data of five chrysomelid beetle species and T. castaneum.
Fig. 2
figure 2

Phylogenetic analysis. (a) Phylogeny and orthology analyses between D. provosti and other Chrysomelidae species, and T. castaneum was select as outgroup. The expanded (Red) and Contracted (Blue) gene families are presented alongside the species and nodes; (b) GO enrichment of expand gene family of D. provosti.

Data Records

Raw sequence reads and genome assembly are available on NCBI Sequence Read Archive database with accession numbers BioProject PRJNA123808333; Pacbio, Hi-C, Illumina and transcriptome sequencing reads have been deposited in the Sequence Read Archive (SRA) databases with the accession number of SRP57196134; Genome assembly has been deposited at the NCBI under the accession number of JBMHBG00000000035; datasets include genome assembly and gene annotation are deposited on figshare36.

Technical Validation

The accuracy and completeness of genome assembly and gene annotation were validated using a multi-approach. First, Illumina reads were quality-filtered using fastp (v0.23.1)9 with Q30 = 93.4%, then quality-filtered reads mapped to the assembled contigs using BWA (v0.7.5a)37, and the resulting alignment processed via SAMtools (v1.19.2)38 revealed a mapping rate of 98.90%, and a coverage rate of 97.9%. Second, RNA-Seq reads from three whole-body transcriptomes were quality-filtered using fastp (v0.23.1)9 with Q30 = 93–95%, and alimented to the genome assembly with hiast2 (v2.2.1)39, the results indicated that >95% of reads aligned to coding regions. Third, BUSCO (v5.2.2)40 analysis for the genome assembly using the Insecta odb10 dataset (-l insecta_odb10 -m genome) identified 98.6% of expected insect single-copy orthologs, with 98.6% were classified as complete, comprising 1.8% duplicated genes, while 0.4% were fragmented. To further ensure the comprehensiveness of the gene annotation, BUSCO (v5.2.2)40 was also employed with the Insecta odb10 database (-l insecta_odb10 -m prot), identifying 97.2% of conserved single-copy orthologs in the annotated protein set (97.2% complete, 1.7% duplicated, and 0.4% fragmented). RNA-Seq analysis of three whole-body transcriptomes demonstrated expression of 18,320 (87%) annotated genes in at least one sample. Finally, homology searches against NCBI-NR, KEGG, InterPro, or GO databases revealed significant sequence similarity for 19,635 (98%) of predicted gene models in at least one database.