Background & Summary

Blister beetles (Coleoptera: Meloidae) are a highly diverse family within Polyphaga, with a wide global distribution1,2. Adult blister beetles pose a major threat to crops such as alfalfa, wheat, legumes, and nightshades, and can contaminate harvested forage, posing serious toxicity risks to horses and livestock3,4. Although chemical insecticides remain the primary method for controlling these pests, their overuse has led to the emergence of resistant populations, environmental contamination, and unintended impacts on beneficial organisms. RNA interference (RNAi)-based pest management strategies have emerged as a promising alternative.

The family Meloidae comprises approximately 130 genera and ~3,000 described species5,6. Taxonomic diversity is highly skewed, with nearly half of the species concentrated in just five genera7,8. Among these, Hycleus (Meloinae, Mylabrini) stands out as the most species-rich genus, containing over 450 described species and representing one of the most recently diverged lineages in the family3,9.

Like all members of the Meloinae, Hycleus adults primarily feed on the flowers and leaves of plants in Desmodium spp., as well as sweet potatoes, while larvae exhibit highly specialized feeding behaviors, either feeding on locust eggs or parasitizing beehives3,10. These dietary transitions are associated with changes in sensory systems, particularly in olfactory and visual systems, which are critical for host recognition and foraging behaviors, and have important implications for non-chemical, biological pest control strategies11,12. The genus Hycleus exhibits distinctive ecological and evolutionary traits, including specialized host-plant interactions and complex life-history strategies, making it a valuable model for studying speciation and insect-plant dynamics. While prior research has primarily explored their taxonomic structure, ecological niches, behavioral adaptations, and geographic ranges, the molecular basis of these characteristics remains underexplored7,10,13,14.

Despite significant advances in sequencing technologies and the increasing availability of genomic data for non-model organisms, blister beetles (Meloidae) remain markedly underrepresented in genomic databases. Currently, only one of the ~3,000 described blister beetle species has a chromosome-level genome assembly available15. This lack of genomic resources on blister beetles hampers the study of these destructive pests. There is a critical need for high-quality genomic data to advance studies on blister beetle evolution, adaptation, and pest biology.

Here we present the first chromosome-level genome assembly and annotation of H. marcipoli. This high-resolution genome serves not only as a valuable resource for comparative genomic analyses but also as a foundational reference for future investigations into the evolutionary dynamics, functional genomics, and ecological roles of blister beetles.

Methods

Sample collection

This study collected a total of 10 female and male H. marcipoli specimens from Huanjiang County (24°83’N, 108°21’E), Hechi City, Guangxi Province, China in August 2019. Three female adult specimens were used for PacBio, Hi-C sequencing and transcriptome sequencing. The abdomens of all specimens were removed before DNA extraction to avoid contamination from intestinal contents, and the remaining body tissues were used for genomic DNA extraction. Genomic DNA extraction and sequencing, as well as RNA sequencing, were carried out by Biomarker (Biomarker Technologies Co., LTD in Beijing, China).

DNA extraction and genome sequencing

High-quality DNA was extracted using DNeasy Blood and Tissue Kits from QIAGEN Inc. DNA quantity and quality were then measured using a 2100 Bioanalyzer (Agilent) and a Qubit 3.0 Fluorometer (Invitrogen), with integrity confirmed via 1% agarose gel electrophoresis. For PacBio long sequencing, the DNA was purified using AMPure PB beads, and the final high-quality gDNA was used for subsequent library construction. The PacBio SMRTbell library was constructed using SMRTbell® Express Template Prep Kit 3.0. Qualified libraries were evenly loaded on SMRT Cell and sequenced using Sequel II system. The Hi-C library was constructed according to the standard protocols described previously16. It was then constructed and sequenced using the Illumina NovaSeq 6000 sequencing platform with 183x depth. Finally, 4.60 Gb raw PacBio continuous long reads and 21.15 Gb Hi-C data was generated (Table 1).

Table 1 Sequencing raw data of the H. marcipoli assembly.

RNA extraction and transcriptome sequencing

Total RNA was extracted from a single adult female specimen without biological replication. RNA was extracted from tissues using standard CTAB-LiCI extraction methods17 followed by rigorous quality control of the RNA samples by means of an Agilent 2100 bioanalyzer (Agilent Technologies, Santa Clara, CA, USA): precise detection of RNA integrity. The cDNA library was built using TruSeq RNA Sample Prep Kit v2 and sequenced on the Illumina NovaSeq 6000 platform. A total 6.72 Gb RNA data was generated (Table 1). Low quality sequences and adapter contamination in whole genome sequence data from the above steps were filtered using Trimmomatic v.0.3918.

Genome assembly

Quality control on raw Illumina data performed using fastp v0.23.219 using default parameters. To estimate the genome size of H. marcipoli, we used PacBio reads as input data and applied KmerGenie20. We estimated the genome size to be approximately 128.52 Mb (Table 2). We assembled the PacBio reads using Flye v2.3.5b21 with default parameters and used Purge Haplotigs22 to identify and remove redundant contigs. The initial contig genome size was 126.63 Mb. After removing redundancy and identifying potential contaminants, we obtained an optimized genome of 111 Mb distributed across 168 contigs, with a contig N50 of 4.65 Mb and a scaffold N50 of 10.22 Mb (Table 2). Prior to scaffolding, the high-quality Hi-C library data were aligned to the genome draft using BWA v0.7.1723 and Samtools v1.1424. The draft genome of H. marcipoli was further scaffolded using high-quality data from the Hi-C library with HapHic25. After scaffolding, manual adjustments were made using Juicebox v2.1526. Finally, 92.97% of the contigs (107.29 Mb) were anchored to 11 chromosomes, with chromosome lengths ranging from 6,843,577 bp to 141,844,471 bp (Fig. 1).

Table 2 Genome assembly and annotation statistics of H. marcipoli.
Fig. 1
figure 1

(a) Hi-C contact map showing chromosome-level assembly validation of the H. marcipoli genome. The heatmap displays interaction frequencies between genomic regions, with darker colors indicating higher contact probabilities; (b) Distribution of contigs along the 11 chromosomes of H. marcipoli.

Repeat annotation

The Earl Grey pipeline (v4.1.0)27 was used to identify repetitive elements. Approximately 30.50 Mb of the genome was identified as repetitive sequences, constituting 26.43% of the entire genome (Fig. 2). Transposable elements (TEs) occupy 7.62% of the genome, with DNA elements being the dominant TE type at 5.41%, followed by long terminal repeats (LTRs) at 1.68%, long interspersed nuclear elements (LINEs) at 0.52%, and short interspersed nuclear elements (SINEs) at 0.02%. Notably, 14.35% of the repeat sequences were unclassified (Table 3).

Fig. 2
figure 2

Schematic representation of the genomic characteristics of H. marcipoli. (1) Gene density; (2) GC content density; (3) Repeat element density.

Table 3 Annotation and percentage of repeat sequences in genome from Earl Grey.

Protein-coding gene annotation

For gene annotation, we combining three strategies: ab initio prediction, homologous gene comparison, and transcriptome-based annotation. Ab initio prediction was performed using BRAKER v2.1.528, which automatically trained Augustus v3.3.429 and utilized both transcriptome data and protein homology information. The RNA-seq data in BAM format were generated via HISAT2 v2.2.030, while protein sequences were retrieved from the OrthoDB10 v131 database. For transcript assembly, the mapped transcriptome data were further processed with StringTie v2.1.432. For homology-based annotation, gene sets from five annotated species in Tenebrionoidea—Tribolium madens33, Tenebrio molitor34, Zophobas morio35, Tribolium castaneum36, and Asbolus verrucosus37—were downloaded. Of these, three are the closest related species published to date. Downloaded protein sequences were then aligned against H. marcipoli genome assembly using BLASTP38 and were identified using GeneWise. Finally, we used the EVidenceModeler (EVM) pipeline v1.1.139 to integrate the results from the three strategies. We identified a total of 13,357 protein-coding genes, with an average gene length of 4,401 bp (Fig. 2). Further analysis of gene structure revealed a total cDNA length of 20.55 Mb, with the longest cDNA being 44,727 bp and the average cDNA length being 1,538 bp. The total protein length was 6.85 million amino acids, with the longest protein being 14,909 amino acids and the average protein length being 513 amino acids (Table 4). Among the 13,357 protein-coding genes in H. marcipoli, the average number of exons per gene was about 5, with the average length of a single exon being 398.87 bp, and the average length of an intron being approximately 2,420 bp, with single intron averaging 620 bp (Tables 5, 6). The NR (Non-redundant) database, the SwissProt database, the Interproscan database and the EggNOG-mapper database were used for alignment and to functionally annotate the predicted gene structures. Based on gene functional annotation, 12,944 genes were annotated in at least one database, accounting for 96.91% of the total predicted genes (Table 1). BUSCO analysis (Insecta_odb10)40 identified 98.80% of the genes, further confirming the accuracy and completeness of the gene prediction (Fig. 4).

Table 4 Statistics of protein-coding gene annotations of H. marcipoli genome.
Table 5 Statistics of protein-coding gene structures of H. marcipoli.
Table 6 Number of protein-coding genes in different sequences of H. marcipoli genome and percentage of gene characteristics.

Data Records

We have uploaded the raw sequencing data (including Pacbio data, Hi-C data and transcriptome data) to the NCBI database. The BioProject accession number is PRJNA1225931, BioSample accession number is SAMN46911070. The RNA-Seq are available under accession number SRR3247938341. The genomic PacBio sequencing data can be found in the NCBI Sequence Read Archive (SRA) database under the accession numbers SRR3248973442. Hi-C sequencing data refers to accession numbers SRR3247938243 in the SRA database. The final genome assembly was deposited in the GenBank under the accession number: GCA_051167335.144. Genome annotation information of repeated sequences, gene structure is available in the Figshare database45.

Technical Validation

To validate the accuracy of H. marcipoli’s genome, we mapped our transcriptomic data to the genome, achieving a 99.47% mapping rate and thus confirming the high quality of the H. marcipoli genome. The BUSCO v5.2.2 assessment (Insecta_odb10) indicated a high completeness of 99.8% (Fig. 4). We further validated the accuracy and reliability of H. marcipoli’s gene structure by comparing its gene distribution with those of other annotated species, and found that consistent patterns across all species supported the accuracy of our gene annotation data. (Fig. 3). Overall, the evaluation results indicate that our H. marcipoli genome assembly is complete, accurate, and of high quality.

Fig. 3
figure 3

Annotated genes comparison of the distribution of gene length, CDS length, exon length, and intron length in H. marcipoli with other species with annotation. The x-axis represents the length and the y-axis represents the density of genes. Hma, H. marcipoli; Tca, Tribolium castaneum; Tmo: Tenebrio molitor; Tma: Tribolium maden.

Fig. 4
figure 4

BUSCO assessments of assembly and annotation.