Background & Summary

Ladybird beetles, members of the Coccinellidae family within the order Coleoptera, encompass over 6,000 species globally1. Predominantly predatory, ladybird beetles are natural enemies of agricultural pests such as aphids, mealybugs, scale insects, and mites2,3. However, there are also phytophagous and mycophagous species, with the phytophagous ones having the potential to inflict substantial damage on economically significant crops4. Ladybirds are also a focal point of chemical ecology research, given that many species exhibit aposematic coloration and secrete toxic alkaloid compounds when disturbed5. Due to their diverse forms, behaviors, and ecological roles in agriculture, ladybird beetles are extensively studied as model organisms in ecology and evolutionary biology6,7.

Serangium japonicum exhibits high prey intake, a brief generation cycle, an extended adult lifespan, and substantial reproductive capacity as a predatory species8. It effectively targets several whitefly species, including Bemisia tabaci9,10,11, Dialeurodes citri12, and Aleurocanthus camelliae13. This predatory behavior renders S. japonicum a beneficial insect in agriculture and underscores its crucial role in integrated pest managemen14. The genome assembly of twelve species from the family Coccinellidae has reached the chromosome level in the NCBI database (accessed June 2024). However, neither genome assemblies for S. japonicum nor chromosomal-level genomes for Serangium species have been reported.

To better understand the genetic basis of S. japonicum’s adaptability and predatory behavior. We successfully assembled the chromosome-level genome of S. japonicum by integrating data from PacBio HiFi, Illumina, and Hi-C data. We comprehensively annotated repeats, non-coding RNAs (ncRNAs), and protein-coding genes. The high-quality genome of S. japonicum marks a substantial progression in the study of Coccinellidae, offering important insights into its evolutionary trajectory and ecological roles.

Methods

Sample collection and sequencing

The S. japonicum samples utilized in this study were collected in May 2022 from Baiyun Mountain in Guangzhou, Guangdong Province, and were reared under controlled conditions for 10 generations (26 ± 1 °C, 70%–75% relative humidity, 14 L:10D). Adult individuals were thoroughly rinsed with phosphate-buffered saline and then rapidly flash-frozen in liquid nitrogen.

Genomic DNA was extracted using the FastPure® Blood/Cell/Tissue/Bacteria DNA Isolation Mini Kit (Vazyme Biotech Co., Ltd, Nanjing, China), while RNA was extracted with TRIzol reagent (YiFeiXue Tech, Nanjing, China). A total of 12 adult individuals of mixed gender were used for transcriptome sequencing. For Hi-C sequencing, 6 adult individuals of mixed gender were employed, and the library construction involved formaldehyde cross-linking, chromatin digestion with the restriction enzyme MboI, end repair, DNA cyclization, and DNA purification15. All short-read libraries were sequenced on the Illumina NovaSeq6000 platform. Additionally, for PacBio sequencing, DNA was extracted from 10 adult individuals of mixed gender, and a library with an insert size of 20 kb was prepared using the SMRTbellTM Express Template Prep Kit 2.0. This library was subsequently sequenced on the PacBio Sequel II platform in HiFi mode. All procedures related to library preparation and sequencing were performed by Berry Genomics (Beijing, China). Finally, the PacBio HiFi reads accounted for 39.97 Gb (59.81×), Hi-C reads totaled 110.14 Gb (164.80×), RNA-seq data amounted to 17.41 Gb, and RNA-ONT data totaled 11.56 Gb (Table 1). The PacBio HiFi long reads achieved a scaffold N50 of 15.08 kb, with an average length of 15.17 kb.

Table 1 Statistics of the genome sequencing data for Serangium japonicum.

Estimation of genomic characteristics

We analyzed the genomic features of S. japonicum using a K-mer approach, with K-mer counts generated by BBTools v38.8216 (K = 21). This analysis estimated a genome size of approximately 422.19 Mb and revealed substantial repeat content (39.6%) and heterozygosity (1.94%), as characterized by GenomeScope v2.017 (Fig. 1).

Fig. 1
figure 1

Estimated characteristics of the Serangium japonicum genome based on a 21-mer count histogram from Illumina short-read data.

Genome assembly

We initially employed Hifiasm v0.16.118 with default parameters for the preliminary assembly of PacBio HiFi long reads. Purge_dups v1.2.519 was applied to remove heterozygous regions in the assembly based on contig similarity. A haploid cutoff of 70 was set to identify contigs as haplotigs. Quality control of the Hi-C data and read alignment was performed using Juicer v1.6.220. Next, we used the 3D-DNA v18092221 to anchor contigs into chromosomes. The contig assembly results were carefully examined, and any errors in the assembly were manually corrected using Juicebox v.1.11.0821. To detect possible contaminants, we conducted BLASTN-like searches with MMseqs2 v1322 against the NCBI nucleotide and UniVec databases. Additionally, we used blastn (BLAST + v2.11.0)23 specifically with the UniVec database to detect vector contaminants. Sequences with over 90% similarity to entries in these databases were marked as possible contaminants. For sequences with similarity exceeding 80%, further verification was performed using online BLASTN against the NCBI nucleotide database. To ensure the purity of the assembled scaffolds, sequences potentially originating from bacterial and human sources were systematically removed.

The final assembly of the S. japonicum genome achieved a chromosome-level resolution, encompassing a total size of 433.74 Mb, comprising 17 scaffolds and 104 contigs. The longest scaffold and contig length is 82.69 and 28.54 Mb, with scaffold N50 length of 42.67 Mb and contig N50 length of 11.44 Mb, demonstrating exceptionally high assembly continuity. The genome maintains a GC content of 27.90% (Table 2). The majority of contigs, accounting for 99.84% and totaling 433.04 Mb, were anchored into ten chromosomes. These chromosomes exhibited lengths varying from 21.42 Mb to 82.69 Mb. (Figs 2; 3).

Table 2 Genome assembly statistics for Serangium japonicum.
Fig. 2
figure 2

The heatmap of the Serangium japonicum genome showing chromosome-level scaffolding with ten anchored chromosomes. Red indicates high intra-chromosomal contact frequencies.

Fig. 3
figure 3

Circular genome map of Serangium japonicum, showing the distribution of genomic features across ten chromosomes. Rings represent chromosome length, GC content, gene density, and repeat elements (DNA transposons, SINEs, LINEs, LTRs, and simple repeats). The central image depicts an adult S. japonicum.

Genome annotation

We established a de novo repeat library of S. japonicum focusing on the distinctive structure and de novo prediction of repeat sequences. Employing RepeatModeler v2.0.424 with the ‘-LTRStruct’. This newly constructed database was subsequently integrated with Dfam 3.525 and RepBase-2018102626 databases to form a comprehensive reference dataset for repeat sequences. Subsequently, RepeatMasker v4.1.427 was employed to conduct S. japonicum genome identification of repetitive elements using a custom-built library. Our analysis identified a total of 731,474 repeat sequences, encompassing 54.66% of the genome (237.06 Mb). The predominant repeat sequence categories include unclassified elements (34.19%), LTR transposons (2.28%), LINE transposons (6.93%), DNA transposons (6.90%), and Simple repeat (1.42%) (Table 3).

Table 3 Genome assembly and annotation statistics of Serangium japonicum.

We employed Infernal v1.1.428 along with the Rfam v14.10 database29 for the annotation of ncRNAs within the S. japonicum genome. Additionally, tRNAscan-SE v2.0.930 was utilized for the prediction of tRNA sequences, with low-confidence tRNAs subsequently filtered using the ‘EukHigh Confidence Filter’ script. The analysis revealed 1,270 ncRNAs in the S. japonicum genome, mainly including 3 lncRNAs, 2 ribozymes, 78 snRNAs, 55 miRNAs, 247 tRNAs, and 861 rRNAs (Table 3).

The protein-coding gene annotation for S. japonicum was performed using MAKER v3.01.0331, which integrates transcribed RNA, ab initio gene predictions, and homologous proteins. Transcribed RNA was aligned using HISAT2 v2.2.132, and the resulting RNA-seq alignments facilitated genome-guided assembly through StringTie v2.1.133. Ab initio gene predictions were performed using BRAKER v2.1.634, which integrates GeneMark-ES/ET/EP 4.68_lic34, GeneMark-ETP35, and Augustus v3.5.036. These tools were automatically trained on RNA sequence alignments and reference proteins sourced from the OrthoDB v11 database37. Gene predictions were performed using GeMoMa v1.938, incorporating protein sequences from six different species (Drosophila melanogaster (GCA_029775095.1)39, Apis mellifera (GCA_019321825.1)40, Chrysoperla carnea (GCA_905475395.1)41, Tribolium castaneum (GCA_000002335.3)42, Coccinella septempunctata (GCA_907165205.1)43, and Harmonia axyridis (GCA_011033045.2)44). Finally, the MAKER pipeline predicted sum of 12,299 protein-coding genes, with an average gene length of 12,065.7 bp. Each gene had an average of 6.1 exons, with an average exon length of 305.6 bp. On average, each gene had 5.1 introns, with an intron length averaging 2,116.1 bp. Furthermore, genes contained about 5.9 coding sequence regions (CDS), which averaged 266.8 bp in length (Table 3). The completeness of the protein sequences was evaluated with BUSCO v5.0.445, yielding an impressive score of 97.0% (n = 1,367). This encompassed 84.9% (1,160) single-copy, 12.1% (166) duplicated, 0.2% (3) fragmented, and 2.8% (38) missing BUSCOs, indicating high-quality predictions.

We employed Diamond v2.0.11.146 in highly sensitive mode (‘–very-sensitive -e 1e-5’) to conduct gene function searches against the UniProtKB database. Subsequently, InterProScan 5.58-91.047 and eggNOG-mapper v2.1.548 were utilized to simultaneously query five databases: Pfam49, SMART50, Superfamily51, CDD52, and Gene3D53. These analyses aimed to predict conserved protein sequences and domains within the gene set, as well as provide insights into Gene Ontology (GO) terms and pathways (KEGG, Reactome). InterPro identified protein domains for 10,178 protein-coding genes, while InterPro and eggNOG-mapper jointly annotated GO terms for 9,074 genes and assigned KEGG pathway entries to 4,356 genes.

Data Records

The genomic project of Serangium japonicum has been uploaded to NCBI. The datasets for Hi-C, transcriptome, RNA-ONT, and PacBio HiFi are accessible using the identifiers SRR2925231954, SRR2925232055, SRR2925231856 and SRR2925232257. The assembled genome has been submitted to the NCBI database under the accession number GCA._040543525.258. The annotation results have been uploaded in figshare59.

Technical Validation

The completeness of the assembly was evaluated using BUSCO v5.0.445, referencing the Insecta database (n = 1,367). The results indicated a BUSCO completeness of 97.8%, comprising 97.3% single-copy gene, 0.5% duplicated gene, 0.1% fragmented gene, and 2.1% missing gene. The analysis involved using Minimap2 and SAMtools software to align the reads from PacBio, Illumina, and RNA sequencing to the final assembly. Furthermore, the alignment rates of the RNA-sr, RNA-ONT, and HiFi data were observed to be 93.76%, 99.39%, and 99.85%, respectively. These findings substantiate the exceptional quality of the S. japonicum genome assembly.