Background & Summary

The Scarabaeidae family, commonly known as scarab beetles, is one of the most diverse lineages within the order Coleoptera, comprising approximately 35,000 described species worldwide1. These beetles are distributed worldwide, occurring on all continents except Antarctica and occupying a wide range of ecosystems, including forests, grasslands, deserts, and agricultural landscapes2. Based on their feeding strategies, scarab beetles are broadly classified into two groups: coprophagous (dung-feeding) and phytophagous (plant-feeding) species3,4. The coprophagous group, also known as dung beetles, includes two ecologically and economically significant subfamilies: Scarabaeinae and Aphodiinae5,6. Among them, members of Scarabaeinae are considered the “true dung beetles,” primarily utilizing fresh dung as both a food source and for reproductive purposes7. In contrast, phytophagous scarabs, commonly known as Pleurosticti8, comprise more than 20,000 species, representing nearly 70% of the entire Scarabaeidae family4,9. These beetles feed on a wide array of plant materials, including leaves, roots, decaying wood, and partially decomposed plant litter10. Such a diet requires specialized digestive adaptations to break down complex plant cell wall components, including lignocellulose and various hemicelluloses, which are chemically resistant to degradation11.

High-quality genome assemblies are essential for uncovering the genetic mechanisms underlying feeding adaptations in Scarabaeidae beetles. A recent study generated a chromosome-scale genome assembly and comprehensive intestinal transcriptome for Trypoxylus dichotomus (Coleoptera: Scarabaeidae), offering valuable insights into its ability to digest lignocellulose-rich plant material12. As of April 2025, a total of 62 Scarabaeidae genomes have been deposited in the NCBI database. However, only 13 species have been reported at the chromosome level. The majority of the remaining genomes, assembled from short-read sequencing data, are of limited quality, with scaffold N50 values typically below 100 kb. This shortage of high-quality genomic resources continues to hinder in-depth research on environmental adaptation, dietary specialization, and the evolutionary diversification of Scarabaeidae.

To enhance our understanding of adaptive evolution ecology, we propose assembling a chromosome-level genome of Kibakoganea sinica, Bouchard, 2005 (Coleoptera: Scarabaeidae), combining PacBio HiFi, Illumina, and Hi-C data. We annotated repeats, non-coding RNAs, and protein-coding genes. This high-quality genome assembly of K. sinica provides valuable insights into the evolution and ecological adaptation of the Scarabaeoidea superfamily.

Methods

Sample collection and sequencing

A K. sinica pupa was collected in Guizhou, China, on December 6, 2023, and used for genome sequencing, including Illumina, PacBio, Hi-C, and RNA sequencing. To minimize contamination, the sample was carefully rinsed in phosphate-buffered saline for 10 minutes, flash-frozen in liquid nitrogen for 20 minutes, and subsequently stored at –80 °C until further processing.

Genomic DNA and RNA were isolated from the specimen using the DNeasy Blood & Tissue Kit (Qiagen) and TRIzol Reagent (Thermo Fisher Scientific), respectively, in accordance with the manufacturers’ instructions. Short-read libraries were prepared without PCR amplification using the Illumina TruSeq DNA PCR-Free Kit, generating 150 bp paired-end reads with 350 bp inserts. For Hi-C sequencing, we implemented a standard protocol13, including DNA crosslinking, MboI digestion, end repair, and DNA purification. All short-read sequencing was conducted using an Illumina NovaSeq. 6000 system. For long-read sequencing, we constructed a 20 kb SMRTbell library (PacBio SMRTbell Express Template Prep Kit 2.0) and sequenced it on the PacBio Sequel II system in HiFi mode. Library construction and sequencing were conducted at Berry Genomics (Beijing, China). Our sequencing generated a total of 160.95 Gb of high-quality data, including 36.70 Gb (61.02 × coverage) PacBio HiFi reads, 56.09 Gb (93.26×) Illumina short reads, and 58.56 Gb (97.36×) Hi-C data (Table 1).

Table 1 Statistics of the sequencing data used for genome assembly.

Genome assembly

Raw Illumina reads were processed for quality control using BBTools v38.8214. Duplicate reads were first removed with “clumpify.sh”. Subsequently, bbduk.sh was applied to trim low-quality bases (Q < 20) and adapter sequences according to strict quality criteria. Specifically, sequences with quality scores below 20 were discarded, reads containing more than five Ns were filtered out, poly-A/G/C tails longer than 10 bp were trimmed, and overlapping paired reads were corrected. To estimate the genome size, heterozygosity, and repetitive sequence content in the K. sinica genome, a genome survey was conducted using GenomeScope v2.015. The estimated genome size ranged from 567.23 to 568.25 Mb, with repetitive sequences accounting for 31.57–31.58% of the genome. Additionally, the survey indicated a high heterozygosity rate, estimated at 2.11–2.13% (Fig. 1).

Fig. 1
figure 1

Genome size estimation for Kibakoganea sinica based on GenomeScope.

The initial genome assembly was generated using PacBio HiFi long reads and assembled with Hifiasm v0.19.816 under default parameters. To reduce heterozygosity, we applied Purge_Dups v1.2.517 with a haploid cutoff of 70 (-s 70) to identify and remove haplotigs. For chromosome-scale scaffolding, Hi-C reads were first quality-filtered and then aligned to the assembly using Juicer v1.6.218. Contigs were subsequently anchored and ordered into chromosomes using 3D-DNA v.18092219. The final assembly was manually verified and corrected in Juicebox v.1.11.018 to resolve potential misjoins or orientation errors. To ensure the assembly’s purity, we screened for contaminants using MMseqs. 2 v1.120 against the NCBI nucleotide (nt) and UniVec databases, removing any detected foreign sequences. Potential vector contaminants were identified using v2.11.021 against the UniVec database, with sequences showing > 90% similarity flagged as contaminants. Additional sequences exhibiting > 80% similarity were further validated through BLASTN searches against the NCBI nucleotide database (nt). All identified bacterial and fungal contaminants were thoroughly removed from the assembly scaffolds. Telomeric regions of each chromosome were identified using the TeloExplorer module in QuarTeT v1.2.122. The presence of continuous telomeric repeat motifs (TTAGG) within 10,000 base pairs at both ends of a chromosome was used as the criterion for confirming telomere localization (Table 3). The final chromosome-scale assembly of K. sinica spans 601.44 Mb, consisting of 23 scaffolds and 70 contigs, which is consistent with the genome size estimated in the genome survey. The assembly demonstrates high continuity, with scaffold and contig N50 values of 60.23 Mb and 24.49 Mb, respectively (Table 2). Notably, 99.57% of the assembly (598.84 Mb) was successfully anchored to 10 chromosomes, with individual chromosome lengths ranging from 16.33 Mb to 103.53 Mb (Table 3; Fig. 2; Fig. 3). The chromosome sequence names are assigned based on sequence length, with the longest sequence labeled as chromosome 1 (Table 3). Moreover, the BUSCO assessment revealed a genome assembly completeness of 99.2% (Table 2). Collectively, these findings demonstrate that our genome assembly achieves outstanding continuity and structural integrity.

Table 2 Genome assembly statistics for Kibakoganea sinica.
Table 3 Summary of telomere information for the Kibakoganea sinica genome.
Fig. 2
figure 2

The chromosomal heatmap visualization of Kibakoganea sinica genome assembly displays complete chromosomes in blue, with individual contigs demarcated by green borders.

Fig. 3
figure 3

The genomic features of Kibakoganea sinica are displayed in a circular layout. Moving inward from the outermost ring, the visualization depicts (1) chromosome length, (2) GC content, (3) gene density, and (4) various repetitive elements, including transposable elements (DNA, SINEs, LINEs, and LTRs), along with simple repeat sequences.

Genome annotation

The species-specific repeat library of K. sinica was generated using RepeatModeler v2.0.423 and integrated with known repeats from RepBase-2013090924 and Dfam 3.525 to construct a comprehensive repeat database. The custom repeat database was employed as input for RepeatMasker v4.1.426 to systematically identify and mask repetitive elements throughout the genome, followed by soft-masking of these regions. The analysis revealed that repetitive sequences account for 44.43% of the K. sinica genome assembly. These elements were classified into major categories, including unclassified elements (18.09%), LINE transposons (6.81%), LTR transposons (8.53%), DNA transposons (17.00%), and other repeat types (Table 4).

Table 4 Genome assembly and annotation statistics of Kibakoganea sinica.

Non-coding RNAs (ncRNAs) in K. sinica were identified using Infernal v1.1.427 with the Rfam v14.10 database28, while tRNA detection was performed with tRNAscan-SE v2.0.929. The analysis revealed a diverse ncRNA repertoire, comprising 312 tRNAs, 101 rRNAs, 74 microRNAs, and 69 small nuclear RNAs, totaling 596 ncRNAs (Table 4).

Protein-coding gene annotation of the K. sinica genome was performed using MAKER v3.01.0330, integrating transcriptomic evidence, ab initio predictions, and protein homology data. Transcriptome sequences were aligned to the genome using HISAT2 v2.2.131, followed by genome-guided assembly with StringTie v2.1.632. For ab initio gene prediction, BRAKER v2.1.633 was employed, incorporating GeneMark-ES/ET/EP 4.68_lic34 and Augustus v3.4.035, both of which were trained using transcriptomic sequences and protein data from OrthoDB v1136. Additionally, homology-based gene prediction was conducted using GeMoMa v1.937, utilizing protein sequences from five reference species: Drosophila melanogaster (GCF_000001215.4)38, Apis mellifera (GCA_003254395.2)39, Coccinella septempunctata (GCA_907165205.1)40, Prosopocoilus inquinatus (GCA_036172665.1)41, and Tribolium castaneum (GCA_031307605.1)42 (Table 5). The annotation pipeline identified 12,940 protein-coding genes in the K. sinica genome, with an average gene length of 14,792.6 bp (Table 4). On average, each gene contained 6.3 exons, 5.3 introns, and 6.1 coding sequences (CDS). Gene structure analysis revealed mean lengths of 357.3 bp (exons), 2500.2 bp (introns), and 272.5 bp (CDS). To assess the quality of gene predictions, we evaluated gene set completeness using BUSCO with the Insecta dataset (n = 1,367). The results showed 80.6% (1,102) single-copy, 18.5% (253) duplicated, 0.1% (2) missing, and 0.8% (10) fragmented BUSCOs, confirming the high accuracy and reliability of the annotation.

Table 5 Species taxonomic information and accession code of all samples used in this study.

Functional annotation was performed by aligning protein sequences against the UniProtKB database using DIAMOND v2.0.1143. Additionally, Gene Ontology (GO) terms, KEGG/Reactome pathways, and protein domains were annotated using eggNOGmapper v2.0.1444 and InterProScan 5.53-87.045. The InterProScan analysis integrated data from five databases: Pfam46, SMART47, Superfamily48, Gene3D49, and CDD50. Functional annotation revealed 11,414 COG categories, 10,333 GO terms, and 5,009 KEGG pathways in K. sinica, based on the integration of InterProScan and eggNOG annotations. Chromosomal features, including repeat elements, gene density, and GC content, were visualized using TBtools51.

Data Records

The sequencing data generated in this study are available under the following National Center for Biotechnology Information (NCBI) SRA accession numbers: transcriptome reads (SRR31019928)52, Hi-C data (SRR31019929)53, Illumina short reads (SRR31019930)54, and PacBio HiFi long reads (SRR31019931)55. The final genome assembly is available under NCBI accession GCA_043790905.156. Genome annotation data, including repetitive elements, gene structure predictions, and functional annotations, have been deposited in Figshare57.

Technical Validation

We evaluated genome assembly quality using two complementary approaches. First, assembly completeness was assessed with BUSCO v5.0.458 against the Insecta reference set (n = 1,367 conserved single-copy orthologs). The assembly exhibited a BUSCO completeness of 99.2%, with 95.2% of genes in single-copy, 4.0% duplicated, 0.2% fragmented, and 0.6% missing. Second, the assembly accuracy was assessed by calculating mapping rates through the alignment of PacBio, Illumina, and RNA-seq reads to the final assembly using Minimap2 v. 2.2359 and SAMtools v. 1.960. The assembly demonstrated high mapping rates for PacBio (99.85%), Illumina (89.71%), and RNA-seq (93.71%) reads. These comprehensive analyses confirm the high quality of our genome assemblies.