Introduction

The evolutionary history of Lepidoptera, one of the most emblematic insect orders, has advanced exponentially as studies employing different methodologies have revealed various aspects of the phylogenetics of this order. Notably, those based on the analysis of morphological characters1 and molecular genetic data2,3,4,5, including the analysis of whole mitochondrial genomes6,7,8,9, greatly aided the in-depth understanding of the evolution of the order Lepidoptera. Despite extensive studies, many of which have broad taxonomic representation, the evolutionary history within certain lepidopteran superfamilies remains unclear. This uncertainty is also noticeable in the basal or so-called non-Obtectomera superfamilies of the Apoditrysia crown-clade. The phylogenetic relationship of the 14–15 superfamilies considered to belong to this group10,11 remained poorly understood and often described as unresolved and treated as polytomous12. Phylogenetic studies on these superfamilies often yield contradictory results. Studies applying different methodologies, such as morphological data combined with a few genes3 or integrating mitochondrial and nuclear regions13 resulted in conflicting conclusions. Genome-scale molecular data is a powerful resource for reconstructing and understanding the evolutionary processes of challenging groups such as the basal apoditrysian superfamilies. However, until recently, these lineages lacked sequenced genomes14, and the absence of molecular genetic data has created substantial uncertainties surrounding the basal Apoditrysia superfamilies, despite the economic, taxonomic, and conservation importance of their constituent species. While ongoing genome sequencing efforts have begun to address this gap, the availability of genome-scale molecular data remains strikingly limited in some basal apoditrysian superfamilies.

Cossoidea is one of the well-recognized superfamilies within the non-obtectomeran Apoditrysia clade, including Paracossulus, a monotypic genus represented by Paracossulus thrips (Hübner, 1818) (Cossoidea: Cossidae: Cossinae)15. This understudied species is a rare component of the Eurasian fauna with high conservation importance. Its distribution area spans longitudinally from southwestern Siberia (Altai region) to Central Europe (Hungary) and extends southward to Iran and Turkey16,17. At the westernmost edge of its range, the species persists in fragmented populations in Bulgaria18,19,20, Hungary21, Romania22,23, Serbia20, and with a single historical record from southeastern Poland24. In the European Union, it is protected by the EU Council Directive (NATURA2000 code: 4028) but is not included in the IUCN Red List. The species is classified as critically endangered in Bulgaria18, endangered in Hungary25, and vulnerable in Romania23. The species is threatened primarily by habitat loss due to urbanization, agricultural activities, or other infrastructural developments19,20,22.

The knowledge of P. thrips was severely limited for a long time, with only some recent studies20,22,26 providing new insights, including indications of its ecological demands. The first study22 employing molecular markers to study this species presented the first and, so far, the only available molecular genetic data of the species: partial cytochrome c oxidase I (COI or cox1) sequences of four specimens. This study confirmed the phylogenetic position of P. thrips within the Cossoidea superfamily, consistent with the earlier morphology-based taxonomic classifications, and confirmed the monotypic nature of the genus. However, to gain a deeper understanding of the broader phylogenetic relationships and to provide effective molecular tools for conservation management for this species, generating genome-level data is essential.

In this study, we aimed to sequence and assemble the mitochondrial genome of two P. thrips individuals from the same population. We then characterized these genomes to provide valuable resources for evolutionary and conservation research of the Cossoidea superfamily—a largely understudied group within the Lepidoptera order.

Results

Whole-genome sequencing

We obtained whole-genome sequencing datasets from two individuals on an MGI DNBSEQ-G400RS platform. Whole genome sequencing yielded 150 bp long paired-end reads, totaling 43.9 gigabasepair (Gbp) (292,983,070 reads) for the male individual (Sample ID: CAT07) and 40.7 Gbp (271,272,886 reads) output for the female individual (Sample ID: CAT08). Given that this dataset consists solely of short reads, it is not suitable for de novo assembly of a high-quality nuclear genome. Therefore, in this paper, we focused only on the organellar reads derived from these datasets.

Of the raw reads, 1,234,704 (0.185 Gbp, 0.42%) from the CAT07 sample and 949,376 (0.142 Gbp, 0.35%) from the CAT08 sample were aligned to the reference mitochondrial genome of Eogystia hippophaecolus. The mean depth of coverage of the aligned reads was 7,451 for CAT07 and 6,103 for CAT08.

General characteristics of the assembled mitogenomes

The assembled mitochondrial genomes were complete and circular with lengths of 15,395 bp for the male specimen (CAT07) and 15,385 bp for the female specimen (CAT08). Both mitogenomes contain 13 protein-coding genes (PCGs), 22 tRNA coding regions, two rRNA coding regions, and an A + T-rich non-coding control region (CR), and 19 shorter (1–62 bp) intergenic non-coding spacers (Fig. 1; Table 1). The majority of PCGs and tRNA genes, along with the CR, are located on the heavy strand, whereas nad1, nad4, nad4l, and nad5 PCGs, and the tRNA-Cys (GCA), tRNA-Gln (UUG), tRNA-His (GUG), tRNA-Leu (UAG), tRNA-Phe (GAA), tRNA-Pro (UGG), tRNA-Tyr (GUA), and tRNA-Val (UAC) tRNA genes, as well as both rRNA genes, are located on the light strand.

Fig. 1
figure 1

The Paracossulus thrips and the circular maps of the two mitochondrial genome assemblies. A: A female specimen of P. thrips (Photo: Sándor Jordán); B: Circular map of the mitochondrial genome of the male (CAT07) specimen; C: Circular map of the mitochondrial genome of the female (CAT08) specimen. The bars surrounding the maps’ backbones represent the genetic regions identified within the mitogenomes. The orientation of the arrowheads at the ends of the bars indicates the strand on which each gene is encoded and the direction of translation for the protein-coding genes.

Table 1 Annotation of the two mitochondrial genome assemblies of Paracossulus thrips.

A total of 11 mutations were identified between the two assemblies (Table S2). These included six single nucleotide polymorphisms (SNPs), two insertion/deletion (indel) mutations, and three microsatellites. Among the SNPs, five were transitions and one was a transversion. Of these, one occurred in the A + T-rich non-coding region, one in a tRNA gene, and four in the PCGs. One of the mutations in the PCGs was silent, while the remaining three led to changes in the amino acid sequences of the encoded proteins. The microsatellites were exclusively located in the intergenic regions (Table S2).

The total base composition of the assembled mitogenomes was nearly identical between samples: A: 39.9%, C: 14.5%, G: 7.8%, and T: 37.8% in the CAT07, and T: 37.7% in the CAT08 sample. Based on the base content, the AT-skewness was slightly positive (0.03), while the GC-skewness had a negative value (−0.30) in both mitochondrial genome assemblies (Table S3).

Protein-coding genes

The lengths of the PCGs ranged between 165 bp (atp8) and 1,737 bp (nad5). The majority of the PCGs initiated with ATN codons. Among these, the ATG start codon was observed in seven genes (atp6, cox2, cox3, cytb, nad1, nad4, nad4l), while five genes (atp8, nad2, nad3, nad5, nad6) were initiated with ATT codon (Table 1). As an exception, CGA start codon was identified in the cox1 gene (Table 1) which is commonly observed in Lepidoptera species27,28. Most PCGs terminated with the TAA stop codon, however, an abbreviated stop codon (T) was detected in the cox2 and nad4 genes. Four point mutations were identified across four distinct PCGs (cox1, cox3, nad1, nad4l) between the two mitogenomes (Table S2). These mutations, comprising three transitions and one transversion, all resulted in non-synonymous codon changes, regardless of their position within the codons. The transversion in the nad4l gene affected the third codon position replacing the ATA (Met) codon with ATT (Ile). The transitions altered the first codon positions leading to the replacement of GCC(Ala) for ACC (Thr) in cox1, AGA (Ser) for GGA (Gly) in cox3, and CCT (Pro) for TCT (Ser) in nad1 (Table S2).

All PCGs were characterized by a generally higher proportion of A and T bases with their content ranging from 68.1 to 88.5% (average = 76.7%). The AT-skewness had an average negative value (−0.13), ranging from − 0.29 to −0.01, indicating a higher abundance of T over A in the protein-coding genes. GC-skewness varied between − 0.68 and 0.53 (average = −0.09), reflecting a generally higher prevalence of C over G in the PCGs, except for a few genes (nad1, nad4, nad4l, and nad5) where G was more abundant (Table S3).

Analysis of amino acid frequency (Fig. S1) and relative synonymous codon usage (RSCU) (Fig. S2) showed that leucine (L), isoleucine (I), phenylalanine (F), and serine (S) were the four most frequently encoded amino acids, while cysteine (C) was the least common. Analysis of the RSCU also revealed a higher occurrence of codons with A or T, a commonly observed pattern in Lepidoptera mitogenomes29.

Transfer RNA and ribosomal RNA genes

The lengths of the tRNA genes ranged from 64 bp to 72 bp. Based on the predicted secondary structures (Fig. S3), most tRNA genes exhibited a cloverleaf structure. An exception was observed in tRNA-Ser(UCU), which lacked the dihydrouridine arm similar to several other arthropod species7,9,28,30,31,32.

The 12 S rRNA gene was 776 bp in both assemblies. In contrast, two length polymorphisms were detected in the 16 S rRNA gene (Table S2), with lengths of 1,336 bp in the CAT07 and 1,334 bp in the CAT08 sample.

Consistent with the higher A and T base content in the mitogenomes, the tRNA and rRNA genes also exhibited higher A + T levels, ranging from 72.7 to 92.8% in the tRNA genes and 82.0–85.2% in the rRNA genes. The AT-skewness values ranged from − 0.10 to 0.15 (average = 0.02) in the tRNA genes, and from − 0.04 to 0.03 in the rRNA genes, indicating an almost balanced ratio of A and T bases in these regions. The values of GC-skewness for the tRNA genes varied between − 0.33 and 0.50 (average = 0.15). In contrast, the rRNA genes had consistently positive and closely similar GC-skewness values ranging from 0.36 to 0.44. This suggests a higher proportion of G bases compared to C bases in the ribosomal RNA-coding genes (Table S3).

Non-coding and overlapping regions

Several non-coding regions were identified in the assembled mitogenomes. The longest of these is a 375 bp non-coding region located between the 12 S rRNA and tRNA-Met (CAU) genes (Table 1), often referred to as the control region. This region exhibited a characteristically high A + T content in both assemblies, at 93.3% in CAT07 and 93.6% in CAT08. The AT-skewness value was slightly negative (−0.06), while the GC-skewness was distinctly negative with an average of −0.39. These values indicate a nearly balanced ratio of A and T bases, and a higher proportion of C bases compared to the G bases (Table S3). This region contains several conserved motifs typically found in Lepidoptera mitogenomes. Among these, the ‘ATAGA’ motif and the subsequent poly-T run ((T)18) are located upstream of the 12 S rRNA gene and are associated with the initiation of mitogenome replication33. Additionally, several An (n = 2–8) and Tn (n = 2–5) polynucleotide runs were identified throughout this region, including a shorter poly-A motif (4 bp long) at the end of this region, towards the tRNA-Met(CAU) gene, which is another frequently occurring feature in lepidopteran mitogenomes28.

The shorter intergenic spacers often consist of only a single base pair, while the longer ones may include repetitive structures (Table S2). The 17 bp long spacer found between the tRNA-Ser (UGA) and nad1genes contains the ‘ATACTAA’ motif, which is conserved among Lepidoptera species and associated with transcription termination27,34.

Shorter overlaps (1–8 bp) were identified between several genes (Table 1). Among these, the ‘ATGATAA’ sequence found in the overlap between the atp8 and atp6genes is conserved across the majority of the Lepidoptera order28.

Phylogenetic reconstruction

We conducted phylogenetic reconstructions using mitochondrial protein-coding genes to gain insight into the evolutionary relationships of the Cossoidea superfamily and to determine the phylogenomic position of Paracossulus thrips based on the currently available mitogenomes. The phylogenetic analyses yielded a consistent topology across all partitioning schemes (see Materials and Methods) and substitution models applied: all phylogenetic relationships were fully supported, excluding a single node at the basal position of Zeuzerinae (see below). Support for this basal node varied across analyses but remained low (Fig. S4). Since our partitioning scheme I) yielded the highest support for this node and was used for our primary result (Fig. 2). The reconstructed trees demonstrated high or full statistical support for nearly all branches (Fig. 2, Fig. S4). The ingroup (Cossoidea: Cossidae) formed a fully supported monophyletic unit. The diversification of two major clades was revealed within the ingroup corresponding to the Cossinae and Zeuzerinae subfamilies of the Cossidae family2,15. Our Paracossulus thrips specimens were placed as sisters to the Eogystia hippophaecolus on branches with full statistical support. Notably, these two species are the only representatives of the Cossinae subfamily, which formed a monophyletic group with full statistical support in all analyses (Fig. 2).

Fig. 2
figure 2

The reconstructed phylogenetic tree of Cossoidea. The present tree was reconstructed using the ML approach based on the merged nucleotide sequences of 13 mitochondrial PCGs with nucleotide substitution models. The phylogenetic tree shown here was reconstructed with all genes treated as distinct partitions. Branch support values represent SH-like approximate likelihood ratio test (SH-aLRT) and ultrafast bootstrap (UFboot) with 10,000 replicates. Branches with full statistical support (100/100) are marked with an asterisk (*). Bars on the right-hand side denote classification at the (1) subfamily, (2) family, and (3) superfamily level. Results from alternative partitioning schemes or substitution models are provided in Supplementary Fig. S4 A–C.

The remaining ingroup species formed another fully supported clade corresponding to the Zeuzerinae subfamily. However, the basal placement of Phragmataecia castaneae within Zeuzerinae remains equivocal (Fig. 2, Fig. S4). Despite the uncertain basal relationships within Zeuzerinae, subsequent nodes were strongly supported. First, Chalcidica minea was found on a fully supported branch, followed by all Zeuzera species forming the crown clade of the Zeuzerinae subfamily. However, our results revealed taxonomic inconsistencies within the Zeuzera samples. Specifically, the two Z. multistrigata samples and the two Z. pyrina samples were not recovered as sister groups (Fig. 2, Fig. S4). It is important to note that the Z. coffeae sample (KJ508046.1), which was positioned next to one of the Z. multistrigata specimens in our analyses, had only a fragment of the mitochondrial genome, consisting of only five genes. This resulted in missing data for most genes in this sample. Despite this shortcoming, all branches within the Zeuzera genus were strongly supported. The observed polyphyly within Zeuzera makes it unlikely that missing data in the Z. coffeae sample caused the polyphyletic placement of other species’ samples. An alternative explanation could be the misidentification of some samples that provided the mitogenomes involved in these analyses. Still, hybridization between closely related species or incomplete lineage sorting could also be invoked. However, tracing the reasons behind the taxonomic inconsistency found between our Zeuzera samples is beyond the scope of this study.

Discussion

Paracossulus thripsis a rare and understudied moth species of the Eurasian steppe fauna. The accumulation of molecular data is highly desired for the conservation management of endangered species, underscoring the importance of genomic resources in this context35.

Herein, we present the first complete mitochondrial genomes of P. thrips, making it the second species within the Cossinae subfamily with a fully characterized mitogenome sequence. Given that this species belongs to the relatively understudied Cossoidea superfamily, these assemblies represent a valuable genetic resource for further research on this taxon.

The mitogenomes contained 13 protein-coding genes, 22 tRNA genes, two rRNA genes, a long A + T-rich non-coding region, and several short intergenic spacers ranging from 1 to 62 bp. The gene order follows the typical organizational pattern observed in most Lepidoptera mitogenomes28. The A-T-rich non-coding region is located between the 12 S rRNA and tRNA-Met (CAU) genes, and the arrangement of tRNA-Met (CAU), tRNA-Ile (GAU), tRNA-Gln (UUG) genes (Fig. 1; Table 1) are consistent with features shared by Lepidoptera mitochondrial genomes28,31. Furthermore, numerous conserved nucleotide motifs, characteristic of Lepidoptera mitogenomes, were also identified in the assembled mitogenomes.

Our phylogenetic reconstruction, despite limitations due to poor taxonomic representation because of the scarcity of available mitochondrial genomes for the Cossoidea superfamily, still allowed to draw informative taxonomic conclusions. The additional mitochondrial genomes included in the analyses were all from members of the Cossidae family as other Cossoidea families lacked mitochondrial genome representation in databases at the time of this study. Our results showed taxonomic consistency, with two major clades corresponding to the Cossinae and Zeuzerinae subfamilies. The examined Paracossulus thrips samples were placed sister to Eogystia hippophaecolus forming a monophyletic group representing the Cossinae subfamily. The close phylogenetic relationship between P. thrips and E. hippophaecolusis concordant with previous results22 based on the barcoding region of the cox1 gene.

The complete mitochondrial genome sequences of Paracossulus thripsprovide a valuable genomic resource for the overlooked and underrepresented group Cossinea subfamily within the Lepidoptera. In addition, these mitogenomes can be crucial resources for developing species-specific molecular markers that can be used for conservation management and biodiversity monitoring. Such markers enable non-invasive detection of the species’ persistence and distribution through environmental sample and biological remnant barcoding, facilitating taxonomic identification with basic laboratory equipment and skills36,37. Furthermore, these markers can be adapted for environmental DNA (eDNA) analyses, providing a powerful tool for species detection, particularly in difficult-to-survey habitats. Finally, these assembled mitogenomes enhance analytical accuracy by enriching reference databases38.

Materials and methods

Sample collection and DNA isolation

Two individuals, one male (Sample ID: CAT07) and one female (Sample ID: CAT08), were collected in accordance with the Hungarian nature conservation legislation (Act “1996. évi LIII. törvény a természet védelméről” 38. § (1) a) from one of the largest stable populations in Hungary near the settlement Vécs (location: N 47.7796° E 20.1532°), in late August 2020. The collection was conducted using an ultraviolet light trap and supervised by the zoological referee of the competent authority, the Bükk National Park Directorate. The sampling was carried out at the end of the species’ mating season. For the female specimen, the collected individual was also checked to verify if she had laid all her eggs prior to collection. The final two specimens selected for data generation were euthanized in accordance with the guidelines of the American Veterinary Medical Association (AVMA). Collected individuals were preserved in silica-gel and stored in a + 4 °C until DNA extraction to prevent DNA degradation. Genomic DNA was extracted from two legs of both individuals following the protocol of Bereczki et al. (2014)39. The DNA isolates were purified using AMPure XP magnetic beads (Beckman Coulter, Inc., Brea, CA, USA) following the manufacturer’s instructions. After the successful DNA extraction, the two specimens were preserved in the lepidopterological collection of the Department of Evolutionary Zoology, University of Debrecen (Hungary).

Whole genome sequencing

A randomly sheared genomic library was prepared from 500 ng of genomic DNA using the MGIEasy Universal DNA Library Prep Set Kit v1.0 (MGI Tech Co., Ltd., Shenzhen, China), following the manufacturer’s instructions. The library was sequenced on an MGI DNBSEQ-G400RS instrument using the FCL PE150 sequencing set.

Preprocessing of sequencing reads

To reduce the computational burden of de novo assembly, the reads were first aligned to the most closely related mitochondrial reference genome, Eogystia hippophaecolus40 (GenBank accession number NC_023936.1), the only member of the Cossinae subfamily with a published mitochondrial genome. Alignment was performed using BWA v07.1741, and then the aligned read pairs were identified using SAMtools v1.1042 for downstream analyses.

The quality of the mapped reads was assessed using fastqc v0.11.943, then quality filtered using fastp v0.20.144 with default settings. Due to the unbalanced base composition after approximately 100 base pairs (bp), the reads were trimmed to 100 bp, and sequences shorter than 95 bp were discarded using cutadapt v2.1045. Read error correction was performed using Bloocoo v1.0.646.

Assembly and analysis of the mitochondrial genome

The aligned reads were assembled using GetOrganelle v1.7.3.547, with the animal mitochondrial genome specified as the target (-F animal_mt). The assembled mitogenomes were annotated using the MITOS2 server (http://mitos2.bioinf.uni-leipzig.de/index.py)48, and the integrated MiTFi49 performed the tRNA gene annotation and secondary structure prediction. Additionally, the mitochondrial genome assemblies were annotated by annotation transfer using Liftoff v1.6.350 with the Eogystia hippophaecolus mitogenome annotation as a reference. For protein-coding genes (PCGs), if any differences were observed between the two annotations, the given sequences were carefully checked and aligned against the corresponding gene in the E. hippophaecolus to determine the most accurate annotation. The annotated mitogenome maps were visualized using the Proksee server (https://proksee.ca/)51.

Base composition, including percentage values for the individual nucleotides and AT% and GC%, was calculated according to the base counts of the genes and the control region. These values were further used to compute AT- and GC-skewness to describe the base composition of the assemblies. The skewness calculations followed the formulae of Perna & Kocher (1995)52: AT-skew=(A-T)/(A + T) and GC-skew=(G-C)/(G + C), where A, T, G, and C represent the counts of their respective nucleotides. Relative synonymous codon usage (RSCU) values for the mitochondrial protein-coding genes (PCGs) were calculated using Ezcodon53 implemented on the EZmito server (http://ezmito.unisi.it/ezcodon)54, by applying the invertebrate mitochondrial genetic code.

Phylogenetic reconstruction

The phylogenetic relationships within the Cossoidea superfamily were inferred using the nucleotide sequences of the 13 mitochondrial PCGs (atp6, atp8, cytb, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4l, nad5, nad6). Due to the limited number of sequenced mitochondrial genomes for the Cossoidea species, incomplete and unannotated sequences were included in the analysis. Two species from the Yponomeutoidea superfamily were included as outgroups (Table S1). For the annotated mitogenomes, nucleotide sequences of the PCGs were downloaded from GenBank. When annotations were unavailable, the mitogenome was annotated by Liftoff v1.6.347 with the Eogystia hippophaecolusmitogenome annotation serving as a reference. The PCGs were then extracted, separated by gene, and aligned with MACSE v2.0655, applying the invertebrate mitochondrial code (-gc_def 5). Then the aligned sequences were concatenated with AMAS56 resulting in an 11,250 bp alignment. Maximum Likelihood (ML) phylogenetic tree reconstruction was performed with IQ-TREE v2.0.357. An edge-unlinked partition model that relied on ModelFinder Plus58 and the subsequential merging of genes (-m MFP + MERGE) was employed to identify the best-fitting evolutionary model for the dataset. Four runs were performed using different partitioning schemes: (I) merged sequences with all genes treated as distinct partitions (Fig. 2); (II) the first two codon positions and the third codon position defined as separate partitions (Fig. S4-A); (III) only the first two codon positions considered (Fig. S4-B); (IV) merged nucleotide sequences of the PCGs, using codon substitution models applying the invertebrate mitochondrial genetic code (Fig. S4-C). Statistical branch support was assessed using the SH-like approximate likelihood ratio test (SH-aLRT)59 and ultrafast bootstrap (UFBoot)60 with 10,000 replicates. Statistical support was considered to be existing only if SH-aLRT ≥ 80 and UFBoot ≥ 95.