Background & Summary

The geometrid moth Phthonandria atrilineata (Butler, 1881), commonly known as the mulberry looper, is a widespread lepidopteran pest endemic to East Asia, with a particularly significant presence across most of China1,2. This nocturnal insect, belonging to the family Geometridae, is distinguished by its characteristic inchworm-like morphology and locomotion2,3. In recent years, P. atrilineata has emerged as a formidable threat to mulberry agricultural and sericulture systems, primarily affecting deciduous mulberry trees and causing substantial economic losses1,3,4. Beyond direct damage to mulberry leaves, P. atrilineata poses an additional threat as a vector for various pathogens, including microsporidia, viruses, and fungi, which can be transmitted to silkworms. This disease transmission capability further amplifies its negative impact on sericulture, compounding the challenges faced by silk producers3,4. P. atrilineata exhibits remarkable phenotypic plasticity and adaptability across diverse environmental conditions. Its feeding habits directly compete with silkworms for mulberry leaves, a phenomenon that has far-reaching implications for silk production and agricultural economics1,2,4. This competitive interaction, coupled with the species’ ability to thrive in varying climates, positions P. atrilineata as a critical subject for interdisciplinary research spanning ecology, pest management strategies, and the impacts of climate change on insect population dynamics.

Despite its ecological significance and economic damage, genetic resources and scientific literature for P. atrilineata remain remarkably scarce. A comprehensive search of the NCBI GenBank database reveals only the mitochondrial genome sequence for this species1, while the Sequence Read Archive (SRA) lacks any RNA-seq or genome assembly data. Furthermore, a thorough examination of the PubMed database yields merely publications related to this species, all of which focus exclusively on its phylogeny1,5,6. This paucity of genetic information significantly constrains our understanding of P. atrilineata’s biology, evolution, and potential for effective pest management. The absence of a high-quality reference genome, particularly at the chromosome level, presents a critical bottleneck in advancing research on this species. A chromosome-level assembly would provide numerous benefits, including enhanced insights into genome structure and organization, improved identification of genes related to pesticide resistance and adaptation, and a robust foundation for future comparative genomic analyses7,8,9. The development of a comprehensive genomic resource for P. atrilineata is not merely an academic pursuit but a necessity for addressing pressing agricultural and ecological challenges. High-resolution genomic data could enable the identification of potential molecular targets for novel, species-specific pest management strategies, potentially reducing the reliance on broad-spectrum pesticides that can harm beneficial insects and ecosystems7,9.

Here, we present a high-quality chromosome-level genome assembly of P. atrilineata using a combination of PacBio HiFi sequencing and Hi-C techniques. We compared the genome assembly parameters with previously published genomes of four Geometridae congeners to gain insights into the genomic evolution of this family. The assembled P. atrilineata genome had a total length of 345.32 Mb, with a contig N50 of 11.96 Mb, and achieved a complete BUSCO score of 98.5% using the lepidoptera lineage. A total of 336.55 Mb (97.46%) of the sequences were successfully anchored to 31 pseudochromosomes. Genome annotation identified 133.66 Mb of repetitive elements and predicted 15,026 protein-coding genes. This high-quality P. atrilineata genome provides a valuable genomic resource for future studies on the genome evolution and adaptation of geometrid moths, as well as for comparative genomic analyses within Lepidoptera pests.

Methods

Sample collection and sequencing

Adult P. atrilineata specimens were collected from a mulberry plantation in Yizhou District, Hechi City, Guangxi Province, China (24°29′N, 108°36′E). For genomic analyses, thoracic muscle tissue was harvested from ten adults (five males and five females) following wing removal. The wings were removed prior to dissection to ensure sample purity. Thoracic muscles from all individuals were pooled to create a representative sample for DNA extraction. High-molecular-weight genomic DNA was isolated from the pooled thoracic muscles using the DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany) according to the manufacturer’s instructions. About 50 μg high-qualified genomic DNA was sheared into random fragments, and short-read libraries were prepared according to Illumina’s standard protocol. Paired-end reads (150 bp) were sequenced on an Illumina NovaSeq X plus platform. Additionally, a 15 kb SMRTbell library was constructed using another 50 μg DNA following the protocol for the PacBio Sequel2 platform, and circular consensus sequencing (CCS) was performed. A Hi-C library was also constructed following an optimized protocal10 and sequenced on an Illumina NovaSeq X plus platform with paired-end reads of 150 bp. For transcriptome analysis, thoracic muscles from an additional ten adults were preserved in RNAlater solution (Thermo Fisher Scientific) at −20 °C. Total RNA was isolated using TRIzol reagent (Invitrogen), and RNA-seq libraries were sequenced on the Illumina NovaSeq X Plus platform (2 × 150 bp paired-end). Raw sequencing data from all libraries underwent quality control and filtering using fastp v0.23.1211. Biological samples have been archived at the Sericulture Key Laboratory of Hechi University. Computational analyses were performed on a high-performance Linux server (2 TB RAM, 128 threads).

Genome assembly

Before assembly, we estimated the genome size and heterozygosity of P. atrilineata by calculating the 21-mer frequency distribution using Jellyfish v2.3.0 and GenomeScope v2.0 software12,13. We then assembled the PacBio HiFi reads into contigs using hifiasm v0.19.8 with default parameters14. To obtain clean Hi-C data, we filtered the raw Hi-C data using HiC-Pro v3.1.02515. The clean Hi-C data were then aligned to the assembled contigs using the Juicer pipeline v1.6 to obtain the interaction matrix16. The contigs were ordered and anchored using YAHS de novo assembly17. Finally, we manually reviewed the Hi-C contact maps of the final assembly using Juicebox v2.17.0018.

We performed de novo assembly of the P. atrilineata genome at the chromosome level using 34.24 Gb (100-fold coverage) of PacBio HiFi reads, 47.35 Gb (140.69-fold coverage) of clean Illumina short reads, and 14.11 Gb (42-fold coverage) of high-throughput chromatin conformation capture (Hi-C) data (Table 3). The assembled genome size was 345.32 Mb, with 336.55 Mb anchored onto 31 pseudochromosomes (anchor rate of 97.46%) (Fig. 1A; Fig. 3; Table 1). Coverage analysis using all HiFi reads with mosdepth v0.3.319 revealed that the shortest pseudochromosome (Phat_Chr31) exhibits a significantly lower average coverage compared to other chromosomes, suggesting it may represent the sex chromosome. This lower coverage is consistent with expectations for a hemizygous sex chromosome, providing further validation of the assembly’s accuracy in distinguishing sex chromosomes from autosomes.

Fig. 1
figure 1

Circular visualization of the P. atrilineata genome features. (A) Ideogram showing the 31 pseudochromosomes (Chr1-Chr31), with scale bars indicating physical distance (Mb). (B) Gene density distribution plotted in 100 kb windows, where higher color intensity indicates higher gene concentration. (CH) Distribution of various repetitive elements in 100 kb windows: (C) total repeat sequence density, (D) Long Terminal Repeat (LTR) retrotransposons, (E) DNA transposons, (F) Short Interspersed Nuclear Elements (SINEs), (G) unclassified repetitive sequences, and (H) transfer RNA (tRNA) genes. (I) GC content variation across the genome in 100 kb windows, where darker blue indicates higher GC percentage. The central image shows a dorsal view of a P. atrilineata inchworm, illustrating the species’ characteristic morphology.

Table 1 Comparative summary of genome assembly metrics between P. atrilineata and other Geometridae genomes.

Compared with the 21-mer based estimated genome size of 342.25 Mb and a heterozygosity of 1.51%, our genome assembly is slightly larger, which may reflect the high-quality and comprehensive nature of the assembly process (Fig. 2). Despite the relatively high heterozygosity, advanced HiFi sequencing technologies enabled accurate resolution of heterozygous regions and effectively accommodated the relatively small genome size, resulting in a robust and contiguous assembly. Using the lepidoptera lineage dataset, the anchored genome was examined contained 98.5% complete and 0.3% fragmented BUSCO genes. The contig N50 of our assembly reached 11.96 Mb, markedly exceeding those of other annotated Geometridae species, including Operophtera brumata8 (4.33 Mb) and Ectropis grisescens9 (2.69 Mb) (Table 1). Moreover, when compared to all 124 chromosome-level Geometridae genome assemblies catalogued in the NCBI Genomes database, our assembly exhibits superior contiguity relative to the vast majority (ranked 40/125), surpassing the average by 9.46 Mb (Table 1).

Fig. 2
figure 2

K-mer frequency distribution analysis of P. atrilineata genome using Illumina paired-end reads (k = 21). The x-axis represents k-mer coverage depth, and the y-axis shows the frequency at each depth. The main peak at coverage 84 × indicates the average sequencing depth.

Fig. 3
figure 3

Hi-C contact map showing chromosome-level organization of the P. atrilineata genome. The heat map represents interaction frequencies between genomic regions, where darker brown indicates higher contact frequency. The clear diagonal pattern demonstrates strong interactions within chromosomes, while the lack of off-diagonal signals confirms proper chromosome-level assembly. The x and y axes represent the 31 chromosomes, with scale bars in Mb.

Genome annotation

To identify and mask repeated elements, we employed both homology-based and de novo approaches. Briefly, a de novo repeat library was constructed using RepeatModeler v2.0.520. The obtained library was then combined with the Repbase database v21.1221 to identify repetitive sequences in the P. atrilineata genome using RepeatMasker v4.1.522. For noncoding RNA prediction, tRNA genes were predicted using tRNAscan-SE v2.0.623. Protein-coding gene annotation was performed using a combination of homology-based, transcriptome-based, and ab initio prediction methods. First, we used homologs from the selected Geometridae species8,9 (Table 1) as protein-based evidence for predicting gene sets using GeneWise v2.4.124. RNA-seq reads were mapped using HISAT2 v2.2.125, and ab initio prediction was conducted using AUGUSTUS v3.5.026, trained with the transcriptome data. To generate a comprehensive protein-coding gene set, we integrated annotations from all homology-based, transcriptome-based, and ab initio predictions using the GETA pipeline (https://github.com/chenlianfu/geta). Functional annotation of the predicted gene models was performed by searching against several databases, including Nr27, eggNOG28, Pfam29, GO30, and KEGG31.

In total, we predicted 15,026 protein-coding genes using a combination of de novo homolog-based searches and RNA-seq data, of which 14,211 (94.57%) could be functionally annotated (Fig. 1B; Table 1; Table S1). This relatively low gene count is consistent with other Geometridae species, such as O. brumata8 (16,912 genes) and E. grisescens9 (18,746 genes), reflecting a characteristic feature of this moth family. This conservation in gene number across Geometridae suggests evolutionary stability in gene content rather than large-scale gene losses, though the biological significance of this relatively compact gene set warrants further investigation. Functional annotations were derived from multiple databases, with the NCBI Non-Redundant (NR) and the Interproscan protein database providing the highest number of annotations (13,976 and 11,664 genes, respectively). The quality of the predicted proteome was assessed using BUSCO, revealing 98.0% complete and 0.8% fragmented genes, indicating a high-quality gene set, and consisting with the previous whole genome prediction.

Repetitive elements were identified and classified using a combination of de novo and homology-based approaches. In total, 133.66 Mb of repetitive sequences were detected, constituting 39.72% of the P. atrilineata genome assembly (Fig. 1C; Table 2). Interspersed repeats were the predominant category, spanning 98.08 Mb and comprising various transposable element families. These included Long Terminal Repeat (LTR) retrotransposons (7.63 Mb; 2.27%), Long Interspersed Nuclear Elements (LINEs; 21.14 Mb; 6.28%), DNA transposons (4.83 Mb; 1.44%), and Short Interspersed Nuclear Elements (SINEs; 240.61 kb; 0.07%). Additionally, a substantial portion of the repetitive content (64.23 Mb; 19.08%) was unclassified, highlighting the potential for novel repeat families in this species (Fig. 1D–H; Table 2). Furthermore, our analysis revealed the presence of 6,033 transfer RNAs (tRNAs) constituting 0.14% (437.03 kb) of the P. atrilineata genome assembly (Fig. 1I; Table 2).

Table 2 Comprehensive categorization of repetitive sequences in the P. atrilineata genome.

Data Records

The chromosomal-level genome assembly of P. atrilineata has been deposited in the National Center for Biotechnology Information (NCBI) GenBank database under accession of JBJYIT00000000032. The raw sequencing data for Hi-C sequencing, PacBio HiFi, Illumina NGS RNA-seq, and Illumina NGS survey reads have been submitted to the NCBI Sequence Read Archive (SRA) under accession numbers SRR30872806, SRR30872807, SRR30872808 and SRR30872809, respectively. Additionally, the gene structure annotation, gene function annotation, and transposable element (TE) annotation files have been deposited in the Zenodo database33. The extracted coding domain sequences (CDS) and protein sequences have also been deposited in the Zenodo database34. The NCBI BioProject accession number for the sequences reported in this paper is SRP53640135.

Technical Validation

To assess the quality of the genome assembly, Illumina genomic and RNA-seq reads were mapped to the genome using BWA v0.7.1736 and HISAT2 v2.2.125, respectively. The completeness and accuracy of the genome were evaluated using Merqury37 and BUSCO v5.7.138 with the lepidoptera_odb10 database (Table 3). The mapping ratios of the Illumina short reads, PacBio HiFi reads, and transcriptome data were 97.29%, 84.87%, and 94.34%, respectively (Table 3). The number of ambiguous bases (N) per gigabase was 93, and the QV score was 20.72 demonstrating a high level of base-level accuracy. Benchmarking Universal Single-Copy Orthologs (BUSCO) analyses showed that the assembled genome contained 5,093 (98.5% of 5,284) complete sets of the core orthologous genes in the Lepidoptera_odb10 database, which is comparable to that of the previously reported two Geometridae congeners. Furthermore, the coverage analysis was conducted with mosdepth v0.3.319 using all HiFi reads, and the shortest pseudochromosome (Phat_Chr31) was identified as a potential sex chromosome due to its markedly reduced coverage depth of 67.59, far below the average coverage of 81.79 (Table 3). All these metrics suggest a high-quality P. atrilineata genome sequence.

Table 3 Mapping statistics demonstrating assembly completeness and accuracy.