Background & Summary

During the long-term process of coevolution, a plant–insect arms race has developed between herbivorous insects and plants1. Plants can defend themselves chemically through secondary metabolites to prevent herbivory2,3,4,5,6, while insects have evolved corresponding counterstrategies6,7,8,9. Buddleja sp. is a toxic plant, whose sap can kill or paralyze fish and contains a variety of substances, including flavonoids and terpenoids. Species of the genus Sambus Deyrolle, 1864 feed on the flowers and leaves of Buddleja sp., exhibiting strong host plant specificity. This characteristic makes Sambus an ideal model for studying the adaptive mechanisms of insects in relation to their host plants.

The genus Sambus belongs to the subfamily Agrilinae of the family Buprestidae (order Coleopetera). This genus was transferred from Coraebini to Agrilini in 200010, however, its tribal status remains controversial11,12,13,14. It comprises approximately 150 known species worldwide, with the majority distributed in Southeast Asia and East Asia. Sambus kanssuensis Ganglbauer, 1890 (Fig. 1) is widely distributed in western Sichuan and southern Gansu of China. Its host is a toxic plant Buddleja sp.15,16,17,18,19,20. This buprestid species exhibits the significant sexual dimorphism: the frons is copper green in male, while purple bronze in female; and females are distinctly larger than males. The genome was sequenced using adults of S. kanssuensis collected from Kangding City of Sichuan Province, China.

Fig. 1
figure 1

The habitus of Sambus kanssuensis. (A) male, (B) female.

Currently, only genomes of four species have been sequenced and assembled in Buprestidae21,22, however, these genomes have not been annotated. In the present study, we sequenced, assembled and annotated the chromosome-level genome of S. kanssuensis. The complete genome size is 312.42 Mb, including 206 scaffolds, with an N50 of 34.04 Mb. A total of 12,723 protein-coding genes (PCGs) have been identified. The completeness of the genome assembly and annotation is 97.90% and 96.10%, respectively, based on Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis. This high-quality chromosomal-level genome of S. kanssuensis, described herein, will promote research on the taxonomy, ecology and evolution of the jewel beetles as well as the detoxification mechanisms of herbivorous insects.

Methods

Sample collection

In the present study, adult specimens of S. kanssuensis were collected from the plant Buddleja sp. at Paomashan Park (30.04288°N, 101.96951°E, elev. 2740 m) in Kangding City, Sichuan Province, China, on July 3, 2024. The collected specimens were temporarily stored in liquid nitrogen. After returning to the laboratory, the specimens were stored in an ultra-low temperature freezer at –85 °C. To prevent genetic contamination, the abdomen of the specimen is removed. Tissues from the head and thorax were used for genomic DNA extraction.

Genome sequencing

The total genomic DNA was extracted from 35 male adults using the cetyltrimethylammonium bromide (CTAB) method23. After removing impurities, the genomic DNA was sequenced using Illumina and PacBio technologies. The quality of the extracted DNA was assessed using 0.7% agarose gel electrophoresis, and the concentration of genomic DNA was quantified using a Qubit 3.0 fluorometer (Invitrogen, USA). For Pacbio HiFi sequencing, genomic DNA fragments underwent damage repair, adapter ligation, and fragment selection prior to the construction of the DNA library. The PCR-free Single Molecule Real Time (SMRT) bell library was constructed and sequenced using the PacBio Revio sequencing platform. Adapter sequences and low-quality reads were removed using High-Throughput Quality Control (HTQC) v1.92.31024 with default parameters. Data quality control and statistical analyses were performed using PacBio software SMRT Link v12.0 (–min-passes = 3 –min-rq = 0.99), resulting in the final valid data. Finally, a total of 17.67 Gb HiFi reads were obtained (total number: 949,468, average length: 18,607.6 bp, N50 length: 18,692 bp) and subsequently used for genome assembly. For short reads sequencing, the Nextera DNA Flex Library Prep Kit (Illumina, San Diego, USA) was used to construct an Illumina sequencing library with an insert size of 150 bp. High-throughput sequencing was performed using the Illumina NovaSeq6000 platform (Illumina, San Diego, USA). The raw reads were filtered, resulting in 21.63 Gb of clean data.

High throughput Chromosome Conformation Capture (Hi-C) technology was employed to facilitate chromosome-level genome assembly and to capture chromatin interactions throughout the entire genome. The DpnII enzyme was used to digest the purified cell nuclei. Following this, Hi-C samples were generated through a series of procedures, including end repair, biotin labelling, blunt-end ligation, DNA purification, and random shearing into fragments ranging from 300 to 700 bp. Sequencing libraries were prepared using the Plus DNA Library Prep Kit, with insert sizes ranging from 200 to 400 bp. After passing quality control, the libraries were sequenced on the Illumina NovaSeq6000 platform, generating 150 bp paired-end reads. Ultimately, we obtained 46.87 Gb of raw Hi-C reads for assembly.

For full-length transcriptome sequencing, the head tissue from 13 adult females was extracted using the Qiagen Kit (Qiagen Sciences, USA), following the manufacturer’s instructions. The SQK–PCS109 kit (Oxford Nanopore Technologies, UK) was used for the library construction. A specific concentration and volume of the cDNA library were then added to the flow cell, which was subsequently transferred to the Oxford Nanopore PromethION sequencer for real-time single-molecule sequencing. The data were filtered to remove sequences with an average quality score of less than or equal to 7. Finally, 12.96 Gb of valid data and 13,085,109 bp of total bases were obtained. The N50 and N90 of read length were 1,139 bp and 555 bp, respectively.

Survey of genome characteristics and genome assembly

To assess the genome size and heterozygosity of S. kanssuensis, this study conducted a survey of genomic features using K-mer analysis25. The 19-mer frequency distribution analysis (Supplementary Table 1) was performed using Jellyfish v2.2.1026. Subsequently, the genome size and heterozygosity were estimated using GenomeScope v2.027. The predicted genome size of S. kanssuensis was 325.99 Mb, with a heterozygous ratio of 0.96, a duplication rate of 40.62%, and a GC content of 32.28% (Fig. 2).

Fig. 2
figure 2

The genome scope profile of Sambus kanssuensis.

The PacBio HiFi reads were converted to FASTA format using bam2fasta integrated in SAMtools v1.928. Genome assembly was performed using hifiasm v0.19.8-r60329 based on Overlap-Layout-Consensus (OLC) method. The primary assembly was polished with NextPolish v1.1.030. Purge_dups v1.2.531 was then applied to perform haplotype separation, resulting in the final draft genome. Minimap2 v2.17-r94132 was used for mapping reads during redundancy removal and short-read polishing steps. ALLHiC v0.9.833 (-e GATC) was employed to assist in the assembly of chromosomal-scale genomes. Two software tools, 3D-DNA v20100834 (-q 30) and Juicer v1.635 (-g matrial -s MboI -t 30 -S early), were used to anchor primary contigs into chromosomes. Juicebox v1.11.082636 (Coverage) was used to visualize Hi-C contact maps and manually correct errors. The completeness of the genome assembly was evaluated using BUSCO v5.4.737 (-evalue 1e-05) based on the database insecta_odb10. The results indicated that the percentage of complete BUSCOs (C) is 97.9%, reflecting a high level of completeness. The draft chromosome-level genome was 312.42 Mb, including 206 scaffolds, with an N50 of 34.04 Mb and the largest contig size of 41.54 Mb, along with a GC content of 31.69% (Table 1). The genome size of S. kanssuensis is slightly smaller than that of Agrilus biguttatus (368.10 Mb)21, but larger than that of Agrilus cyanescens (292.3 Mb)22. Following Hi-C scaffolding, 98.68% of the genome was anchored to 11 pseudochromosomes, determined based on the chromatin interaction heatmap (Fig. 3). The total length of pseudochromosomes was 309.02 Mb, with individual lengths ranging from 11.70 Mb to 41.54 Mb (Table 2). Among them, the X chromosome was 11.70 Mb in length, comprising 46 contigs.

Table 1 Summary statistics of genome assembly and annotation in Sambus kanssuensis.
Fig. 3
figure 3

The chromatin interaction heatmap among 11 chromosomes (A) and circle genome landscape (B) of Sambus kanssuensi. From outside to inside, the circles represent chromosome (a), gene density (b), tandem repeat density (c), GC content (d).

Table 2 The length and contig number of chromosomes in Sambus kanssuensis.

Genome annotation

The repetitive elements of the S. kanssuensis genome were identified using a combination of de novo annotation and homology-based annotation methods. Tandem repeats and interspersed repeats were the predominant repeat sequences in the genomes. Tandem repeat prediction was performed using the software TRF v4.0938 and MISA v2.139. For interspersed repeats, RepeatMasker v4.1.5 (-noLowSimple -pvalue 0.0001; http://repeatmasker.org) was used to align against public databases for homology-based annotation, while LTR_retriever v2.9.840 (-threads 16 -noanno) and RepeatModeler v2.0.541 (-database mydb -threads 16) were employed for de novo annotation. A total of 10.31 Mb of repeat sequences were identified, accounting for 48.05% of the S. kanssuensis genome, which included 36.26% interspersed nuclear elements, 10.52% long terminal repeats and other sequences (Supplementary Table 2). Among the interspersed repeats, retroelements constituted 8.16%, while DNA transposons accounted for 28.1%. Unclassified repeats comprised 3.96% of the total genome.

Both noncoding RNA genes (ncRNAs) and small nuclear RNA genes (snRNAs) were identified in S. kanssuensis genome (Supplementary Table 3). The ncRNAs includes microRNA genes (miRNAs), ribosomal RNA genes (rRNAs), and transfer RNA genes (tRNAs). In this study, miRNAs, snRNAs, and rRNAs were detected using Rfam database (release 13.0)42 and the program Infernal v1.1.443, while tRNAs were predicted using tRNAscan-SE v2.0.12 (-E -j tRNA.gff -o tRNA.result -f tRNA.struct –thread 16). The ncRNAs were annotated using Infernal v1.1.4 (–cut_ga –rfam –nohmmonly –fmt 2) and RNAmmer v1.244 (-S euk -m tsu, lsu, ssu). The numbers of miRNAs, tRNAs, rRNAs, and snRNAs were 53, 1862, 200 and 34, respectively. The rRNAs included 172 large subunit rRNAs (5S, 5.8S and 28S rRNAs) and 28 small subunit rRNAs (18S rRNAs). The tRNAs had 21 isotypes. The snRNAs included 9 CD-box, 3 HACA-box, 21 splicing and 1 scaRNA.

The PCGs were annotated using integrated strategies that combined transcriptome-based prediction, ab initio prediction and homology-based prediction. For transcriptome-based prediction, the full-length transcript data from Oxford Nanopore Technologies were processed using TransDecoder to predict coding frames. PacBio sequences were processed using subreads, and the circular consensus sequences (CCS) reads are identified through CCS in SMRTLink. Then, IsoSeq v3 (https://github.com/PacificBiosciences/IsoSeq) was employed for full-length identification, error correction, and clustering. The error-corrected and redundant-free full-length sequences were aligned using pbmm2, and the results were further refined and reconstructed into transcripts using IsoSeq. For ab initio prediction, the softwares augustus v3.5.045 (–uniqueGeneId = true –noInFrameStop = true –gff3 = on –strand = both), genscan v1.046 and Glimmerhmm v3.0.447 (-f -g) were utilized for gene annotation. For homology-based prediction, miniprot v0.1348 (–gff -Iut50) was used to identified the PCGs of S. kanssuensis based on the known sequences from Coccinella septempunctata, Harmonia axyridis, Tribolium castaneum, and Ulomoides dermestoides. The results from the above method were integrated using EVidenceModeler v1.1.149 to generate the final gene set. Transcriptome annotation was served as expressed sequence tag (EST) evidence, while homology prediction results provided protein homology evidence, and the combined de novo annotation results were used as input for gene prediction. The results revealed that S. kanssuensis genome contains 12,723 PCGs, with 73,788 exons and 61,065 introns (Table 1). The average length of mRNA and coding sequences (CDS) per gene are 13,101.27 bp and 1,468.96 bp, respectively.

Gene functional annotation was carried out by querying the UniProtKB (SwissProt + TrEMBL) databases using Diamond v2.1.850 (–evalue 1e-05). Protein domain and Gene Ontology (GO) annotations were retrieved through eggNOG-mapper v2.0.14551 with the eggNOG v5.0 database, as well as by running InterProScan v5.60-92.04652 against the Pfam53, Smart54, Gene3D v21.055, Superfamily, and Conserved Domains Database (CDD) collections. A total of 11,977 genes were annotated, representing 94.14% of the total genome. The genome contains 9,333 KEGG pathway terms and 9,707 GO items (Table 3). To date, the S. kanssuensis genome is the first genome with both gene annotation and functional annotation in the family Buprestidae.

Table 3 Statistics of gene functional annotation in Sambus kanssuensis.

Data Records

The raw sequencing and genome assembly data of S. kanssuensis were deposited in NCBI. The PacBio, Illumina, Hi-C, and transcriptome data can be found under accession number SRP55981656. The BioProject accession number is PRJNA121300857. The assembly genome is available in NCBI under accession number GCA_047651835.158. Additionally, the data of genome annotation have been deposited in the Figshare database59.

Technical Validation

Assessment of the genome assembly and annotation

The completeness of the chromosome-level genome assembly and annotation was assessed using BUSCO (Supplementary Table 4). The results indicated that the complete BUSCOs (C) were 97.90% for the assembly and 96.10% for the annotation. The duplication rate of genome annotation was 48.05%. The assembled genome was evaluated based on the map rate and coverage, which were calculated using BWA v0.7.1760. The mapping rate and coverage were 99.71% and 96.12%, respectively.