Background & Summary

Longhorned beetles (family Cerambycidae) represent one of the most species-rich families of beetles, with over 36,000 described species in 4,100 genera worldwide1,2. As a major lineage of phytophagous beetles, most longhorned beetles attack and feed on plant tissues, and their diversification is often explained by a coevolutionary radiation with diversifying angiosperms3,4,5. Among them, the Asian longhorned beetle, Anoplophora glabripennis (Motschulsky), was one of the first species to be investigated for the genetic basis of plant-feeding evolution, due to its broad host range and invasive pest status in the United States6,7,8. Recent transcriptomic and genomic studies of longhorned beetles, including a reference genome of A. glabripennis, have revealed the presence of horizontally acquired plant cell wall-degrading enzymes in the glycoside hydrolase families, which likely facilitate nutrient acquisition from nutrient-poor, recalcitrant woody tissues6,9,10,11. More recently, chromosome-level genome assemblies have also been generated for two other important xylophagous longhorned beetles, Monochamus alternatus (Hope) and M. saltuarius (Gebler, 1830)12,13—major vectors of pinewood nematodes in East Asia—providing an unparalleled opportunity to study the genomic basis of conifer-feeding and adaptive traits associated with life in temperate forests.

Monochamus longhorned beetles are distributed worldwide, except Australasia, and include a monophyletic clade of 18 conifer-feeding species restricted to temperate forests of the Holarctic region. These conifer specialists are inferred to have evolved within a predominantly angiosperm-feeding lineage of Monochamus at the Miocene-Pliocene boundary around 5 million years ago (Mya)14. In fact, most of these conifer-feeding species are known to transmit the pinewood nematode (PWN), Bursaphelenchus xylophilus (Steiner and Buhrer) Nickle (Nematoda: Aphelenchoididae), the causal agent of pine-wilt disease in the Palearctic region15,16,17 and a nematode species native to North America. While the biology and pest control measures for Palearctic Monochamus species—such as Monochamus alternatus and M. saltuarius—have been extensively studied, the genetic mechanisms underlying Monochamus-PWN interactions in their native North American range remains largely unexplored.

In this study, we present the first chromosome-level genome assembly of Monochamus scutellatus (Say), a major vector of PWN in North America18, generated based on PacBio HiFi long reads, Pore-C chromatin confirmation capture, and Illumina RNA-seq data. The genome spans 830.9 Mbp and comprises 10 pseudo-chromosomes (Fig. 2A,B; Table 2), consistent with previous cytological evidence19. Chromosome 10 was identified as the X chromosome based on synteny analysis, which revealed extensive conservation of the X chromosome across Coleoptera (Fig. 2C). With a genome size comparable to those of the two Palearctic congeners—M. alternatus (792.1 Mbp) and M. saltuarius (682.2 Mbp), the M. scutellatus genome demonstrates exceptional contiguity, reflected by fewer scaffolds and a higher N50 (Table 2). As the first genomic resource for a North American Monochamus species, the M. scutellatus genome provides a valuable foundation for investigating the genomic basis of Monochamus-PWN interactions in their region of origin and offers a key comparative framework for testing evolutionary hypotheses on the origin of these interactions, as well as their role in the beetles’ adaptation to utilizing the vast resource of coniferous forests across the Northern Hemisphere.

Methods

Sample collection

Adult specimens of Monochamus scutellatus were collected in July 2023 from Eastern white pine, Pinus strobus Linnaeus (Pinaceae), at Blue Hills Reservation, Milton, Massachusetts, U.S.A. (42°13.237′N, 71°7.037′W; elev. 81 m), using panel traps equipped with monochamol pheromone lures (Fig. 1). To minimize contamination from gut contents, all specimens were starved for several days, flash-frozen in liquid nitrogen, and subsequently cryo-preserved at −80 °C until used for extraction. A total of two female specimens were used: One for PacBio sequencing and the other for Pore-C and Illumina transcriptome sequencing. The voucher specimen for PacBio sequencing (voucher no.: MCZ-SK1313) has been deposited in the Entomology Research Collection at the Museum of Comparative Zoology, Harvard University.

Fig. 1
Fig. 1
Full size image

(A) Voucher specimen of Monochamus scutellatus (MCZ-SK1313; female) used for genome sequencing. (B) Habitat of M. scutellatus consisting of Eastern white pine (Pinus strobus), from which specimens were collected using panel traps equipped with monochamol pheromone lures (Milton, Massachusetts, USA).

Nucleic acid extraction and sequencing

High molecular weight (HMW) genomic DNA (gDNA) was extracted from the thoracic muscle tissue of an individual adult specimen using the Qiagen MagAttract HMW DNA Kit (Qiagen, Hilden, Germany). The integrity of the extracted gDNA was evaluated via gel electrophoresis on a 1% agarose gel with a lambda DNA marker, while concentration and purity were assessed using a Quantus Fluorometer (Promega, Madison, WI, USA) and a Nanodrop Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). Purified HMW gDNA was treated with the Short Read Eliminator (SRE) XL Kit (Pacific Biosciences, Menlo Park, CA, USA) to remove DNA fragments below 40 kbp, and sheared into 20 kbp fragments using the Megaruptor 2 (Diagenode, Liège, Belgium). A PacBio SMRT library was constructed using the SMRTbell Prep Kit 3.0, and sequenced on a single SMRT HiFi cell of the PacBio Sequel IIe system at the National Instrumentation Center for Environmental Management (NICEM), Seoul National University (Seoul, Republic of Korea), generating 34.0 Gbp of HiFi reads (Table 1).

Table 1 Summary statistics of raw sequencing data for Monochamus scutellatus used in genome assembly.

Chromatin conformation capture sequencing was performed on half of a longitudinally bisected specimen (voucher no.: MCZ-SK1314) following the Pore-C protocol20. Briefly, chromatin was fixed in situ within intact nuclei using formaldehyde to preserve native 3-D interactions. Following permeabilization of the nuclei, chromatin was denatured to expose accessible regions and digested with the restriction enzyme NlaIII (New England Biolabs, Ipswich, MA, USA). Proximally crosslinked DNA fragments were then ligated, and purified via phenol:chloroform extraction. The final Pore-C library was prepared using the Genomic DNA by Ligation Protocol [SQK-LSK114; Oxford Nanopore Technologies (ONT), Oxford, UK], and sequenced on a single flowcell of the PromethION system at NICEM (Seoul, Republic of Korea), yielding 22.5 Gbp of Pore-C reads with Phred Q-score ≥10 (Table 1).

Total RNA was extracted from the remaining half of the second specimen using the mirVana miRNA Isolation Kit (Invitrogen, Waltham, MA, USA). RNA concentration and integrity were evaluated using a Nanodrop Spectrophotometer and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). An mRNA library was constructed using the NEBNext Ultra II RNA Library Prep Kit (New England Biolabs, Ipswich, MA, USA) and sequenced on a 150-bp paired-end S4 flowcell of the NovaSeq 6000 platform (Illumina, San Diego, CA, USA) at Novogene (Sacramento, CA, USA), producing 17.2 Gbp of Illumina RNA-seq reads (Table 1).

Genome assembly and scaffolding

To assemble the genome of M. scutellatus, genome size and heterozygosity were first estimated from the raw PacBio reads using Jellyfish v2.3.021 and GenomeScope v2.022 with a k-mer size of 35, which estimated the genome size of 774.11 Mbp and heterozygosity of 1.21%. Primary contigs were assembled using Hifiasm v0.16.123, and haplotypic duplications were resolved by reassigning allelic contigs using Purge Haplotigs v1.1.224. The primary contigs were screened for potential contamination using BlobTools v1.1.125, based on which two of the 33 primary contigs that were identified as prokaryotic or having atypical GC content for arthropods were removed.

Chromosome-level scaffolds were constructed from the primary contigs and Pore-C data using the Pore-C Snakemake workflow20 and the 3D-DNA pipeline v18099226. The Hi-C contact map was visualized and manually curated in Juicebox v2.20.0027, and scaffolds were finalized with 3D-DNA. Scaffolding was further refined with RagTag v2.1.028 using the primary contigs as reference, increasing the mean scaffold lengths from 16.6 Mbp to 27.7 Mbp. Error correction was performed with Inspector v1.3.129 using the original raw PacBio HiFi reads. The final genome assembly was 830.9 Mbp in total length, slightly larger than the estimated genome size, 97.9% of which were assembled into 10 chromosome-scale scaffolds ranging from 152.6 Mbp to 28.4 Mbp (Table 2). A high-resolution Hi-C contact frequency heatmap was generated using HiGlass v0.8.030 to visualize chromosomal architecture (Fig. 2A). The mitochondrial genome was assembled using MitoHiFi v3.231, guided by the mitochondrial genome of Anoplophora glabripennis (GenBank accession: NC_008221.1) as a reference, and the final mitochondrial contig was selected based on annotations from MitoFinder v1.4.132.

Table 2 Summary statistics for genome assemblies and annotations of eight coleopteran genomes analyzed.
Fig. 2
Fig. 2
Full size image

Genomic characteristics of the Monochamus scutellatus genome. (A) Hi-C contact frequency heatmap (Pore-C) of the M. scutellatus genome, showing chromosomal interactions and scaffold continuity across the 10 pseudo-chromosomes. (B) Circular genome plot of the M. scutellatus genome, with inner tracks depicting key genomic features. (C) Time-calibrated phylogeny and macrosynteny across the genomes of M. scutellatus and other beetles, including six Cerambycidae and Tribolium castaneum (Tenebrionidae) as the outgroup. Chromosome nomenclature for reference species corresponds to their respective NCBI genome assemblies.

Repeat elements and gene annotations

Repeat regions and transposable elements (TE) in the M. scutellatus genome were predicted and annotated using both homology-based and de novo prediction approaches within the Earl Grey pipeline v4.4.033. RepeatMasker v4.1.534 was used to annotate repeats based on the Dfam v3.835 TE database for Coleoptera, and RepeatModeler v2.0.536 was employed to generate a species-specific de novo repeat library for M. scutellatus. De novo consensus TE sequences were curated through the “BLAST, Extract, Extend and Trim” (BEAT) process37, and long terminal repeat (LTR) retrotransposons were further annotated using LTR_Finder v1.0738. Repetitive elements accounted for 70.7% of the genome, with unknown repeats and DNA transposons comprising 41.1% and 30.4% of all repeats, respectively (Table 3).

Table 3 Summary statistics of repeat elements across eight coleopteran genomes analyzed.

Gene prediction was performed on the repeat-masked genome assembly using BRAKER v3.0.739, integrating species-specific transcriptomic and protein data, the Arthropoda OrthoDB, and reference protein datasets from Anoplophora glabripennis (GCF_000390285.2), Tribolium castaneum (GCF_000002335.3), Drosophila melanogaster (GCF_000001215.4) and Bombyx mori (GCF_030269925.1). A species-specific protein dataset for M. scutellatus was generated by assembling Illumina RNA-seq reads with rnaSPAdes v3.13.040, and translating RNA contigs into amino acid sequences with TransDecoder v5.7.041. Prior to the assembly, raw Illumina RNA-seq reads were adapter-trimmed and quality-filtered to a minimum Phred Q-score of 33 using Trimmomatic v0.3942. Ab initio gene prediction was conducted using AUGUSTUS v3.5.543, trained with transcriptome-based evidence from GenMark-ET v4.7244 and protein-based evidence from GeneMark-EP + v4.7245. Consensus gene models were generated by merging the outputs of seven gene prediction runs using TSEBRA v1.1.146, retaining only the longest isoform per gene. The final gene annotation was formatted into GFF using gFACs v1.1.247, resulting in a total of 21,110 predicted protein-coding genes (Table 2). Functional annotation of the predicted gene models was subsequently performed using eggNOG-mapper v2.1.1248 based on the Insecta eggNOG database, yielding a total of 13,684 genes functionally annotated (Tables 2, 4), with a BUSCO (Benchmarking Universal Single-copy Orthologs) protein completeness score of 97.6% (Table 5).

Table 4 Summary statistics of functional annotations for protein-coding genes in the M. scutellatus genome.
Table 5 BUSCO assessment for the genome assembly, annotated proteins, and transcriptome assembly for M. scutellatus against the Insecta OrthoDB v10 (n = 1,367).

Synteny-based identification of the X chromosome

Genome synteny was analyzed across chromosome-level genome assemblies of seven Cerambycidae species, with Tribolium castaneum (Tenebrionidae) as an outgroup. Orthologous genes were identified via reciprocal-best BLASTp49 hits among annotated proteins using DIAMOND v2.0.1350. Chromosomal homology was evaluated through Bonferroni-corrected one-sided Fisher’s exact tests implemented in odp v0.3.351, based on the reciprocal-best BLASTp results. Conserved macrosyntenic blocks were visualized as ribbon diagrams. The analysis revealed extensive conservation of the X chromosome across all seven Cerambycidae species and T. castaneum, consistent with previous reports of high X chromosome conservation in Coleoptera12, and permitted the identification of chromosome 10 in the M. scutellatus genome as the X chromosome.

Orthologous gene identification and phylogenomic inference

Orthologous genes and orthogroups across the seven Cerambycidae and T. castaneum genomes were identified using OrthoFinder v2.5.552, with DIAMOND for sequence alignment and FastTree v2.1.1153 for maximum likelihood (ML) tree inference. A total of 3,708 single-copy orthologs were identified, aligned with MAFFT v7.5.2654, and trimmed using trimAl v1.455 with the gappyout algorithm. ML gene trees were inferred using IQ-TREE v2.2.2.656 with the MFP + MERGE option. A species tree was reconstructed under the multispecies coalescent model in ASTRAL v5.7.857, with T. castaneum as the outgroup. Divergence times were estimated within a Bayesian framework in MCMCtree, implemented in PAML v4.10.758, employing the approximate likelihood calculation method and two calibration points: (1) a fossil calibration for the crown-group Cerambycidae, based on the age of Cretoprionus liutiaogouensis Wang et al. from the lower Cretaceous circa 122.5–124.0 Mya59; and (2) a secondary calibration for the Cerambycidae-Tenebrionidae divergence at approximately 220.2 Mya [95% highest posterior density (HPD): 188.1–237.6 Mya]5. The resulting time-calibrated phylogeny supports a sister-group relationship between M. scutellatus and the Palearctic clade comprising M. alternatus and M. saltuarius, which diverged approximately 11.8 Mya (95% HPD: 8.0–16.3 Mya) (Fig. 2C), providing robust evidence for the systematic placement and divergence history of M. scutellatus within Cerambycidae.

Data Records

All raw sequencing data (PacBio HiFi, ONT Pore-C, and Illumina RNA-seq) used for genome assembly and annotation for M. scutellatus have been deposited in NCBI BioProject PRJNA1289024. PacBio HiFi sequencing data, ONT Pore-C sequencing data and Illumina RNA-seq data are available within the NCBI Sequence Read Archive (SRA) under accession numbers SRR3444437960, SRR3444437861, and SRR3444437762, respectively. The final chromosome-level genome assembly has been deposited in GenBank under accession number GCA_052862855.163. The genome assembly and annotation files are available from the Figshare Repository (https://doi.org/10.6084/m9.figshare.2957536164).

Technical Validation

The final genome assembly, constructed using PacBio HiFi and Pore-C sequencing data, along with transcriptome data, was assembled to 10 chromosome-level scaffolds. Genome completeness was assessed with BUSCO v5.8.065 against the Insecta OrthoDB v1066 and revealed 99.0% of core single-copy orthologs, with only 0.8% of duplicated genes (Table 5), exceeding the 90% BUSCO threshold recommended for reference genomes67. To further evaluate assembly quality, raw PacBio HiFi reads were mapped back to the final genome assembly using Minimap v2.2168 within Inspector, resulting in a mapping rate of 99.6%, an average alignment depth of 40.7×, and an assembly quality value (QV) of 36.9 (Table 6), indicating a highly complete and contiguous assembly.

Table 6 Summary statistics for raw long-read sequencing data mapped to the genome assembly of M. scutellatus.