Background & Summary

The longhorn beetle Arhopalus rusticus (Linnaeus) (Coleoptera: Cerambycidae: Aseminae: Arhopalus) is a wood-boring pest of conifers, mainly pine and spruce1,2. Its native distribution includes Europe and Asia3. In recent years, this species has been introduced to Argentina, the United States, Mexico, Australia, and New Zealand1,4,5,6. The A. rusticus tends to feed on weak or dead trees, and shallow roots7,8, especially the dead trees left standing after a fire9,10,11. The larvae bore into the trunk, creating irregular galleries, accelerating decay, and affecting the value of wood12,13. It not only damages host plants but also threatens timber materials during transport. This pest has been intercepted at multiple Chinese ports. Historical interception records confirm its passive dispersal via wood packaging materials, posing a potential threat to forestry ecological security. Well-assembled genomes provide invaluable resources to understand the biology, ecology and evolution of A. rusticus14. Currently, genomes of Cerambycidae have been reported for Anoplophora glabripennis15, Monochamus alternatus16, and Monochamus saltuarius17, but, there is no assembled genome for A. rusticus. Bridging this knowledge gap will greatly aid control efforts against A. rusticus.

In this study, we assembled a chromosome-level genome of A. rusticus using a combination of Oxford Nanopore long-read, Illumina short-read sequencing, and chromosome conformation capture (Hi-C) technologies to provide genomic resources for future investigations and pest management.

Methods

Sample preparation

Samples of A. rusticus were collected from the Double Island Forest Farm in Weihai, Shandong province. A single female adult was used to construct libraries of Illumina, Oxford Nanopore Technology (ONT), and Hi-C sequencing. Samples were starved for 24 hours, and the guts were removed to minimize contamination from gut microbes. Additionally, we collected three replicates of larvae, pupae, and adults for transcriptome sequencing. All samples were frozen in liquid nitrogen and stored at −80 °C before usage.

Genomic DNA sequencing

For short-read sequencing, genomic DNA was extracted using the CTAB method, followed by purification using the QIAGEN® Genomic DNA extraction kit (Qiagen, Hilden, Germany). A paired-end library with a target insert size of 300 bp was prepared using VAHTSTM Universal DNA Library Prep Kit for Illumina® V3 (Vazyme, ND607, Nanning, China) and sequenced on the Illumina X10 platform (Illumina, San Diego, CA, USA). Illumina sequencing yielded 40.36 Gb (47.6 × coverage) of short reads (Table 1).

Table 1 Sequencing data and methods used in this study to assemble the Arhopalus rusticus genome.

For long-read sequencing, high molecular weight genomic DNA was isolated using the QIAGEN® Genomic DNA extraction kit (Qiagen, Hilden, Germany) according to the standard operating procedure provided by the manufacturer. A total of 3-4 μg DNA was used as input material for the ONT library preparation. Long DNA fragments were selected using the PippinHT system (Sage Science, USA). The A-ligation reaction was conducted with the NEBNext Ultra II End Repair/dA-tailing Kit (Ipswich, MA, USA). The SQK-LSK109 adapter (Oxford Nanopore Technologies, UK) was used for further ligation reaction. A DNA library of 700 ng was constructed and sequenced on a Nanopore PromethION sequencer (Oxford Nanopore Technologies, UK) at the GrandOmics Biosciences Co., Ltd. (Wuhan, China), resulting in a total of 68.3 Gb (53.3 × coverage) clean data (Table 1).

Hi-C library preparation and sequencing

For Hi-C sequencing, the library was prepared according to the standard protocol described by Belton with minor modifications18. The sample was cross-linked with a 2% formaldehyde isolation buffer and then treated with Dpn II to digest nuclei. Biotinylated nucleotides were used to repair tails. The resulting Hi-C library was sequenced on the Illumina HiSeq platform with paired-end 150-bp reads (Illumina, San Diego, CA, USA) at Annoroad Gene Technology Co., Ltd. (Beijing, China). A total of 162.1 Gb (137.3 × coverage) of clean data was generated (Table 1).

Transcriptome sequencing

For transcriptome sequencing, total RNA was extracted from each A. rusticus life stage (larva, pupa, and adult) separately using the RNAprep Pure Tissue Kit (Tiangen, China). Libraries were constructed using a TruSeq RNA sample preparation kit (Illumina, San Diego, CA, USA) and sequenced on the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA) with the paired-end mode at GrandOmics Biosciences Co., Ltd. (Wuhan, China), resulting in a total of 76.3 Gb clean data (Table 1).

Estimation of genomic characteristics

The Illumina raw reads were checked and filtered using Trimmomatic version 0.39-219 to discard reads with adaptors, unknown nucleotides (Ns), or >20% low-quality bases. Genome size, heterozygosity, and duplication were estimated by using Jellyfish version 2.2.1020 and GenomeScope version 2.021 with default parameters. Based on 17-mer depth analysis, the genome size was estimated to be 1004 Mb, 1.32% heterozygosity rate, and 1.06% duplication rate (Table 2, Fig. 1A).

Table 2 Statistics for the chromosomal-level genome of the Arhopalus rusticus.
Fig. 1
figure 1

Feature estimation and assembly of Arhopalus rusticus genome. (A) Estimation of A. rusticus genomic features. The 17-mer distributions showed double peaks: the first peak with a coverage of 100 indicates genome duplication, and the highest peak with a coverage of 200 represents a genome-size peak. A. rusticus genome size was calculated to be 1004 Mb with heterozygosity rate of 1.32% and duplication rate of 1.06%. (B) Genome-wide contact matrix of Arhopalus rusticus genome generated using Hi-C data. Each black square represents a pseudo-chromosome. The color bar indicates the interaction intensity of Hi-C contacts.

Genome assembly

We assembled a draft genome at contig level using NextDenovo version 1.2.5 (https://github.com/Nextomics/NextDenovo) with default parameters based on Nanopore long reads. Purge_dups was used to remove alternative haplotype and redundant fragments in the contig assembly. We performed Hi-C analysis to further anchor the assembly into chromosome-scale linkage groups. The Hi-C reads were cleaned using Fastp22 and mapped to the contigs using BWA. YaHS version 1.2a.One23 and Juicertools version 1.19.0224 were used for assembly and manual correction. Finally, two rounds of polishing with ONT reads and Illumina reads were performed using NextPolish version 1.4.025. The resulting chromosome-level genome was 1180.40 Mb with a scaffold N50 of 125.01 Mb, maximum length of 232.23 Mb, and GC rate of 32.32% (Table 2). A total of 98.57% of the genome was anchored to 10 pseudo-chromosomes, which were well-distinguished from each other based on the chromatin interaction heatmap (Fig. 1B).

Genome annotation

Genes in the assembled genome were predicted using a combination of transcriptome-based, ab initio and homology-based methods. For the RNA-based method, short transcriptome reads were mapped to the genome using Hisat226. Then, the aligned BAM files were used to assemble the transcripts using Stringtie version 2.1.427. The genes were predicted using PASA version 2.0.2 with default settings28. The ab initio prediction was performed using AUGUSTUS version 3.4.029 and SNAP version 2006-07-2830, which were trained based on transcripts longer than 300 bp generated by PASA. Homology-based predictions involved downloaded sequences of peptides and transcripts from other species of Coleoptera, including A. glabripennis, Tribolium castaneum31, Dendroctonus ponderosae32 and Diabrotica virgifera33. Redundant genes in the pooled gene set were removed using CD-HIT34. Maker version 3.01.0435 pipeline was used to perform homologue-based prediction. Finally, the evidence from these methods was combined using EvidenceModeler (EVM) version 1.1.136 to generate a high-confidence gene set.

Gene structure and annotations were determined through Eggnog-Mapper version 2.1.937. The methods were used to search against multiple public databases, including Gene Ontology (GO), Clusters of Orthologous Groups of Proteins (COG), Kyoto Encyclopedia of Genes and Genomes (KEGG), CAZY, and Pfam. We identified 18,377 protein-coding genes (Table 2) and 11,368 functionally annotated genes, of which most genes (97.47%) were successfully annotated in at least one public database (Table 3).

Table 3 Number of annotated protein-coding genes in different databases.

Repeats prediction and non-coding RNA annotation

Homology-based and de novo prediction methods were used to detect transposable elements (TEs). Repeats sequences were detected using RepeatMasker version 4.1.2 (-no_is -norna -xsmall -q)38, against the Repbase, Dfam database, and species-specific repeat library identified by RepeatModeler version 2.0.3. Finally, 69.87% of the genome was identified to be repeat DNA. Overall, 473,405 retroelements (6,475 short interspersed nuclear elements (SINEs), 304,805 long interspersed nuclear elements (LINEs), and 162,125 long terminal repeats (LTR)) and 272,785 DNA transposons were identified. Additionally, 487 satellites and 11,473 simple repeats were identified as tandem repeats (TRs), accounting for 0.05% (Table 4).

Table 4 Repeats elements statistics in genomes of Arhopalus rusticus using RepeatMasker.

Noncoding RNA (ncRNA) annotation was conducted using tRNAscan-SE version 1.3.139, and RNAmmer version 1.240,41 for predicting tRNA and rRNA, respectively. We obtained 4,373 tRNA and 79 rRNA, including 50 8s_rRNA, 16 28s_rRNA, and 13 18s_rRNA in the A. rusticus genome (Table 5).

Table 5 Statistics of non-coding RNAs in genomes of Arhopalus rusticus.

Identification of ortholog and inference of phylogenetic relationships

To identify single-copy orthologous genes, we utilized the longest protein sequence of each gene from A. rusticus and multiple other species. We reconstructed a species tree with published coleopteran genome data using OrthoFinder version 2.5.542.

Then, protein-coding genes of A. rusticus and another 11 species of Coleoptera were used for phylogenetic analysis with the Drosophila melanogaster43 as an outgroup. These included M. alternatus, M. saltuarius, A. glabripennis, T. castaneum, D. virgifera, D. ponderosae, Leptinotarsa decemlineata44, Agrilus planipennis45, Photinus pyralis46, Onthophagus taurus47 and Protaetia brevitarsis48. All the genome data were downloaded from the NCBI (https://www.ncbi.nlm.nih.gov) and GigaDB (http://gigadb.org/dataset/100560). The threshold for all protein sequences ALL-VS-ALL alignment was set to e-5. In the above steps, a total of 50 single-copy homologous groups identified by OrthoFinder were used for phylogenetic tree reconstruction, with D. melanogaster as an outgroup (Table 7). MAFFT was used for multiple sequence comparison of sequences in each group, and FastTree version 2 was used to construct phylogenetic trees.

Table 6 BUSCO evaluation of genome assemblies.
Table 7 Overall statistics of the orthogroups among 13 insects in this study.

A molecular clock model is calculated using r8s v1.749. Two nodes with specified diverging time in Timetree database (http://www.timetree.org/) are selected as correction points50 to predict the diverging time of other nodes. This calibration was based on the conclusion that L. decemlineata and D. ponderosae diverged ~191 Mya (million years ago), and T. castaneum and A. planipennis diverged 262 Mya51. The divergence time between A. rusticus and M. alternatus was estimated to be 164.6 Mya (Fig. 2).

Fig. 2
figure 2

Phylogenetic tree and gene ortholog between 13 insect species. The maximum likelihood phylogenetic tree of Arhopalus rusticus and other eleven Coleoptera species, and model species Drosophila melanogaster are used as outgroup, are built using 50 single-copy orthologous genes with 1000 bootstrap replicates. The divergence times are labelled at internodes. The divergence between A. rusticus and Monochamus alternatus diverged 164.6 Mya (Million years ago). The numbers of expanded and contracted gene families are shown in red (expansion) and blue (contraction). The bar chart on the shows the number of orthologous genes of each species.

Gene-family expansion and contraction were estimated using CAFÉ version 4.252 with parameters ‘lambda -s -t’, based on maximum likelihood and reduction methods. Tree topology and branch lengths were considered when inferring the significance of changes to gene-family size in each branch. We identified 77 expanded gene families and 15 contracted gene families in A. rusticus (Fig. 2).

Synteny analysis

We conducted chromosomal collinearity analysis using MCScanX (default parameters) on the A. rusticus genome, with M. saltuarius and M. alternatus as reference genomes. The syntenic blocks among these three cerambycid species were visualized using TBtools. Comparative genomic analysis revealed limited synteny between A. rusticus and M. alternatus, with evidence of significant chromosomal rearrangements including fragmentation and fusion events. Specifically, chromosomes 4 and 9 of A. rusticus were derived from fission of chromosome 4 in M. alternatus (Fig. 3).

Fig. 3
figure 3

Synteny blocks among Monochamus saltuarius, Monochamus alternatus, and Arhopalus rusticus. Msal: Monochamus saltuarius; Malt: Monochamus alternatus; Arus: Arhopalus rusticus.

Data Records

The genome project was deposited in NCBI under BioProject No. PRJNA953210. Illumina sequencing data for genome survey were deposited in the Sequence Read Archive at NCBI under accession number SRR2615123753. Nanopore sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRR2615825954. Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRR2627134755. RNA-seq data were deposited in the Sequence Read Archive at NCBI under accession numbers SRR26151756-SRR2615175856,57,58. The final chromosome assembly was deposited in GenBank at NCBI under accession number JBIRAU00000000059. The contaminant file, single-copy orthologous genes, gene-family expansion and contraction, gene function annotation, and repeat annotation are available in Figshare60.

Technical Validation

The Hi-C heatmap exhibits the accuracy of genome assembly, with relatively independent Hi-C signals observed between 10 pseudo-chromosomes (Fig. 1B). Moreover, we assessed the accuracy of the final genome assembly by mapping Illumina short reads to the A. rusticus genome with BWA-MEM2 version 0.7.172161. The analysis showed that 96.7% of short reads were successfully mapped to the assembled A. rusticus genome.

Furthermore, we evaluated the completeness of the final genome assembly using Benchmarking Universal Single-Copy Orthologues (BUSCO version 5.2.2) using the insecta_odb10 database, which contains 1367 conserved genes62. The analysis revealed completeness of 93.6% for the A. rusticus genome with only 1.6% of BUSCO genes being fragmented, 4.8% being missing, and 0.9% being duplicated (Table 6).