Introduction

The natural world is subject to a continuous day-night cycle, with drastic changes in both intensity and spectral composition of the light environment1,2. For animals, this cycle represents a critical temporal framework within which they conduct their activities. A clear dichotomy often exists between diurnal (day-active) and nocturnal (night-active) animals, with many insects3,4, birds5, and mammals6 restricted to either of the activity periods, often called diel-niche. While binning into nocturnal and diurnal categories is a simplification, given that animals occupy varied activity periods, this allows the examination of diel patterns and how lifestyles shift over an evolutionary time scale7,8.

Shifting from bright to dim environments presents unique challenges for an animal’s sensory systems, especially vision. Mammals, for instance, have evolved a wide range of eye shapes, such as large corneas and pupils that maximize light capture during nocturnal forays9,10. In birds, eye size is linked to light habitat and foraging behavior11. Smaller animals, such as insects, are often limited by absolute eye size. Insects have evolved intricate visual systems, featuring compound eyes with distinct arrangements of ommatidia. While mammals use muscles to contract their pupils, insect eyes have a system of migrating pigment that can bend and manipulate light, pooling it as required to increase sensitivity or resolution12. These eyes are often categorized into two broad classes: superposition and apposition eyes, each tailored to suit the animal’s preferred light environment13,14.

Adapting to dim environments entails adjustments of the visual and nervous systems in different ways. For example, many animals, including humans, rely less on color vision and more on sensitive monochromatic vision at night15,16. Similarly, nervous systems slow down and often sacrifice spatial and temporal resolution to increase sensitivity in the dark17. Studying visual systems often requires time-intensive techniques like behavioral observations or electrophysiology. However, advances in accessible genome and transcriptome sequencing methods have opened new avenues for investigating of the genetic underpinnings of these adaptations, although inferring absolute sensitivity and temporal resolution remains difficult. In birds, it was found that nocturnal owls have a reduced set of color vision genes compared to diurnal species18. The same variation has been reported in fishes, where species adapted to bright environments have a greater set of color vision genes than species that live in the deep sea19. In insects, multiple cases of gene duplications and losses have been reported, owing to the strong selective pressure imposed by light availability20,21. Butterflies and moths (Lepidoptera) are a prime example, where duplications and color vision gene diversification is much more prevalent in diurnal species than species that are active at night22,23,24. This diversification of color vision genes aligns with the over 100 diel transitions recorded in Lepidoptera, featuring multiple evolutionary switches between diurnality and nocturnality25.

Butterflies have emerged as a model system to explore the evolution of color vision in insects26. Hedylidae, commonly known as American moth-butterflies, comprise a single genus, Macrosoma, with 36 described species27 and stand out among their butterfly counterparts in that nearly all species are nocturnal. Marked by moth-like attributes such as filiform antennae and nocturnal flight, hedylids were long classified within the moth superfamily Geometroidea28. Hedylids also possess tympanal ears on their wings, which they use to detect echolocation and defend themselves from nocturnal bat predation29,30,31. Recent phylogenetic analyses reveal that Hedylidae likely diverged from their sister group, skippers (Hesperiidae), approximately 95 Mya32,33, while their own diversification occurred around 30 Mya33. An ancient split from diurnal butterfly lineages suggests the potential relaxation of selective pressures reinforcing diurnality in ancestral hedylids, possibly leading to their transition to nocturnality. Because they are nested in a predominantly diurnal clade, hedylids are an ideal model to study how shifts to a new light environment influence gene evolution.

We present the first annotated genome assembly of the hedylid, Macrosoma leucophasiata, generated with PacBio HiFi sequencing. We examined phototransduction genes in our new genome and compared these genes to existing genome assemblies of a set of available diurnal and nocturnal Lepidoptera species. Since the core set of phototransduction genes exhibit a degree of constancy across species with different diel niches, and many single-copy D. melanogaster genes are conserved in Lepidoptera34, we expected to see a similar phototransduction gene repertoire in M. leucophasiata. We focused our tests on opsins, because much is known about their function, strong genotype-phenotype links, and key role in optimizing vision. We first built opsin gene trees to test whether opsin genes from M. leucophasiata would cluster with those of nocturnal moth species. This was our prediction because M. leucophasiata is nocturnal, and therefore should form a clade that excludes diurnal species. Conversely, forming a clade with other butterflies might imply shallower changes or different constraints during an evolutionary shift to nocturnality. Furthermore, we conducted a test to infer the amount of selection pressure on these genes. We mapped sites to 3D predicted structural protein models, comparing their proximity to functional domains like the retinal binding pocket. By conducting these analyses, we aimed to understand how opsin evolution correlates with visual adaptation to nocturnality. These analyses provide new insights into molecular evolutionary adaptations associated with species’ changes to new light environments.

Results

Contig-scale functionally annotated reference genome assembly

We sequenced the genome of the hedylid, Macrosoma leucophasiata, using PacBio high-fidelity (HiFi) sequencing (Supplementary Fig. 1). A total of 2.85 million HiFi reads were obtained, resulting in 32x read coverage. A preliminary survey using Genomescope 2.0 estimated the genome size to be 452 Mbp by k-mer analysis, with a heterozygosity rate of 1.61% (Supplementary Fig. 2). The curated Hifiasm assembly resulted in a genome size of 616 Mbp comprising 66 contigs and N50 value of 22.3 Mbp (Supplementary Data 1). The draft assembly was generated after filtering haplotigs via purging and non-target sequence removal using BlobTools35. We identified and removed 152 contigs containing 2.82 Mbp (0.11%) of contamination linked to Euglenozoa from the draft assembly. Similarly, haplotig purging led to the removal of duplicated and mismatched contigs (Supplementary Data 2). We assessed BUSCO completeness at each step to ensure consistency and prevent the loss of meaningful information. Assembly statistics and BUSCO scores are shown in Table 1.

Table 1 Summary statistics of the Macrosoma leucophasiata genome assembly

The mitochondrial genome of M. leucophasiata was also assembled into a single contig with a length of 15,209 bp. The mitogenome was annotated with 36 protein-coding genes, 22 t-RNA genes, and two ribosomal RNA genes. The mitogenome is similar in size to the mitogenome of M. conifera (15,344 bp)36.

To assess the extent of repetitive sequences in the final assembly of M. leucophasiata, we modeled and soft-masked repeat regions using RepeatModeler237 and RepeatMasker38. A library of 1082 transposable elements (TE) repeat families was generated corresponding to the Dfam database39. Repeat sequences accounted for 56.79% (349 Mbp) of the assembly. For gene annotation, we used the resulting soft-masked genome to run the BRAKER3 pipeline with protein evidence from the OrthoDB v1140 catalog for Arthropoda to obtain the gene model. A total of 19,292 transcripts from 18,155 protein-coding genes were predicted in the resulting gene model, with BUSCO scores of 92.9% completeness including 83.3% that was single-copy and 9.6% that was duplicated (Supplementary Table 5 and Supplementary Data 3). We annotated predicted genes with gene function, primarily using eggNOG-mapper41 and by performing sequence-similarity searches against the Swiss-Prot Arthropoda database in DIAMOND v2.0.942. A total of 13,169 genes (68.26% of predicted) were functionally annotated in eggNOG (Table 1). Gene ontology (GO) terms for 6598 (~50%) of the annotated genes were obtained (Supplementary Data 4).

Gene family evolution analysis suggests conservation in vision-related genes

Our ML tree based on 3376 BUSCO single-copy orthologs was well-supported (100% support for SH-aLRT and ultrafast bootstrap, Fig. 1). Relationships of butterfly families were consistent with recent phylogenetic studies; Hedylidae and Hesperiidae formed a sister group and Papilionidae (swallowtails) were recovered as the first branching family within the Papilionoidea32,33. We identified 26,990 hierarchical orthogroups (HOGs) with OrthoFinder (Supplementary Data 5). These orthogroups were generated from the protein sequence of the primary transcripts of Augustus gene models (Supplementary Table 1). We evaluated gene repertoire size evolution on the branch leading to our focal species, M. leucophasiata, using CAFE43. We found that 64 and 8 HOGs are under rapid gene expansion (repertoire size increase) and contraction (repertoire size decrease), respectively. These HOGs include many retrotransposon-related genes, odorant receptors, and cytochrome P450 genes. However, none of the candidate vision-related HOGs had a significant repertoire size change (Supplementary Data 6).

Fig. 1: Analysis of visual genes shows high conservation within day and night flying Lepidoptera with a DAGL gene duplication in Macrosoma leucophasiata.
figure 1

Maximum likelihood tree of 20 selected lepidopteran species with high quality genomes, representing seven butterfly and eight moth families (left). The tree was built from 3376 BUSCO single-copy orthologs, and results show 100% support with SH-aLRT and ultrafast bootstrap (black dot at nodes). The diel niche of each species is indicated by orange suns (diurnal) and purple moons (nocturnal). A gene count matrix of phototransduction-related gene families, showing mean gene copy number (right).

We searched for vision genes using their predicted protein sequences and compared them to a database of 32 phototransduction-related gene families (from Macias-Muñoz et al.34). We used BLASTp and identified 142, 149, and 180 putative phototransduction-related genes, clustered in 179 HOGs, from Danaus plexippus, Heliconius melpomene, and Manduca sexta, respectively. A total of 3503 genes from the 20 species in these 179 HOGs were considered putative phototransduction-related gene orthologs. Keyword searches in EggNOG annotations of the remaining 17 species resulted in the extraction of 23 additional HOGs (218 additional genes). The unfiltered vision model from the 20 species thus consisted of 3721 genes in 202 HOGs (Supplementary Data 7). A total of 345 genes were removed, and three HOGs lost all their genes after filtering out putative vision genes without corresponding function (defined from the EggNOG annotation; Supplementary Data 8). Among the filtered genes, 90% (2995 genes) were single copy, and 10% were duplicated. Among duplicated orthologs, 45 genes (1.35%) from 26 HOGs met the first criterion where the sum of sequence lengths were in the range of mean ± standard deviation of other single-copy sequences, 40 genes (1.2%) from 22 HOGs passed the second criterion, and 34 genes (1%) from 21 HOGs passed all criteria and were assembled into a single sequence. The false duplication correction resulted in a vision gene count matrix consisting of 3316 genes in 199 HOGs for these 20 species (Supplementary Data 9). According to this gene count matrix of 20 species, repertoire sizes of vision genes are generally consistently single-copy, with greater copy number variations found in the trp and ninaC families (Supplementary Data 9). Noteably, four of six orthologs of the innexin gene family are missing in M. leucophasiata (Fig. 1). A synteny comparison of M. leucophasiata and two skipper butterflies (Pyrgus malvae and Thymelicus sylvestris) shows that the four innexin genes and 84 other genes are missing from the M. leucophasiata genome due to the absence of a ~1.7 Mbp DNA segment at the tail of contig ptg000026l (Supplementary Fig. 3). This contig segment maps to a region along a different chromosome of T. sylvestris, suggesting a chromosome rearrangement or assembly misjoin. Among the 92 genes in the putative orthologous region at chromosome 4 of T. sylvestris, only four were found in two contig assemblies (Supplementary Fig. 3). Furthermore, all 33 putative missing BUSCO genes were recovered from fragmented unitig sequences (sequences in p_utg.gfa file), suggesting the presence of these genes in M. leucophasiata (Supplementary Table 2).

The 32 phototransduction-related ML gene family trees differed, but genes from butterfly species generally clustered together (Supplementary Data 10). Notably, three of the seven opsin genes (BRh, LWRh, and UVRh) in M. leucophasiata were grouped with nocturnal moths rather than butterflies, with strong branch support for this grouping in BRh and moderate support in LWRh and UVRh based on SH-aLRT, ultrafast bootstrap, and transfer bootstrap expectation (TBE) branch support metrics (Fig. 2 and Supplementary Data 11). Surprisingly, this pattern was only found in a few other genes (Supplementary Data 10). Approximately Unbiased (AU) tests, conducted on BRh and UVRh genes, rejected a monophyletic butterfly grouping (p = 0.006 and 0.002, respectively). This result implies that BRh and UVRh genes found in M. leucophasiata likely did not share a single evolutionary origin with diurnal butterflies.

Fig. 2: Opsin gene family tree showing convergence of visual opsins of the nocturnal butterfly, Macrosoma leucophasiata, with distantly related nocturnal moths.
figure 2

Tree shows branches under positive selection. Branch color indicates diel activity (blue = nocturnal; light orange = diurnal) and their predicted ancestral diel state. Stars indicate branches or nodes detected to be under positive selection by the aBSREL test for episodic diversification. Circles at nodes indicate ML branch support, with the top half of each circle representing SH-aLRT values, and the bottom half representing ultrafast bootstrap values. Node fill indicates the amount of support; black (>70), grey (50–70), white (<50). Only key nodes of relevance are accompanied with circles. For a tree with support values for all nodes, see Supplementary Files.

Positive selection on opsin genes

Our selection analyses on opsin gene trees uncovered evidence of positive selection acting on multiple branches. Specifically, we tested the hypothesis that opsin genes in nocturnal and diurnal taxa vary in their rates of selection (dN/dS ratio) owing to adaptive (positive) selection. We used branch-site (aBSREL44) and site-substitution (MEME45) models for individual opsin gene trees to detect branches and sites under selection. Gene tree reconciliation was performed using GeneRax46, a maximum likelihood species tree-aware gene tree inference software (see Methods). BRh showed significant evidence of positive selection on the branch leading to A. ipsilon (LRT 11.01, p = 0.018).

MEME and aBSREL from the Hyphy suite categorize codons into ω rate classes (ω1, ω2, ω3) based on varying evolutionary pressure. ω1 signifies neutral evolution, ω2 indicates purifying selection, and ω3 suggests positive selection. In UVRh1, the clade encompassing all nocturnal taxa, including M. leucophasiata, showed evidence of diversifying selection (LRT 9.89, p = 0.037). The inferred ω rate classes among the branches were evenly split (ω1 = 49%, ω2 = 51%) in the UVRh1 phylogeny with no ω3 classes identified. Similarly, LWRh opsins indicated signal for positive selection with inferred fluctuating selection pressure on nodes, representing four nocturnal taxa with LW duplications (LRT 13.65, p = 0.007) (Fig. 2). In pteropsin, we identified sites on two branches of the nocturnal clade classified as being under positive selection (LRT 9.41, 40.96; p = 0.047, <0.001) including a disproportionately higher distribution of ω2 rate class (62% of branches, 92% of tree length). We recovered the M. sexta duplication of pteropsin, previously noted by Macias-Muñoz et al. 34, and detected a high proportion of sites with dN/dS >1 on that branch. No significant signal of selection was observed for the RGR-like and Unclassified (UnRh) opsins. Branches under positive selection are annotated on the opsin gene family tree (Fig. 2), and selection statistics are provided (Supplementary Table 3).

Furthermore, we conducted a comprehensive examination of branches identified as being under positive selection. We examined specific sites within these genes and compared them, following the methods of Smith et al.45. Based on the likelihood ratio test, we found that episodic diversifying selection has acted on multiple sites in opsin sequences (Supplementary Data 12). Sites detected to be under diversifying selection were examined using interactive plots generated by ObservableHQ. We used the Empirical Bayes Factor evidence ratio as an exploratory tool to assess the support for positive selection at reported sites. Notably, multiple sites in the BRh, LWRh, and UVRh1 sequences of M. leucophasiata were observed with strong support for positive selection (Supplementary Fig. 4).

Mapping sites under selection to protein models

UV (UVRh) and blue (BRh) opsins recovered seven predicted transmembrane helices and the two LWRh opsins recovered 5 and 6, respectively (Supplementary Fig. 9). Since opsins are known to have 7 helices, we modeled the 3D structure of UV and blue opsins using the spider retinal sequence (6i9k) as a template. We mapped 16 sites (from MEME) in the blue opsin, 5 sites in the UV opsin, and after aligning these with the spider opsin model to get the position of retinal, we compared their proximity to the retinal binding region (4 Å from retinal). Four amino acids of the blue opsin were relatively close (~3–6 Å) to that of the retinal binding pocket, and may influence wavelength (Fig. 3). For UV, we did not find any overlap between residues under selection and retinal binding residues, with only 2 of 5 amino acids (S88 and G149) appearing on the 3D predicted structure. G149 was moderately close to the retinal binding region (~7–8 Å), but not as close as the blue opsin amino acids (Fig. 3).

Fig. 3: Opsin protein modeling reveals the proximity of functional domains and sites under positive selection.
figure 3

A Blue (BRh) opsin Protter model of M. leucophasiata showing predicted transmembrane helices, sites under selection (brown circles) sites surrounding the retinal binding region (RBR) (blue circles), and sites under selection and close to RBR (red circles). B A 3D protein model created using the jumping spider opsin (6i9k) with the position of the retinal inferred by aligning it to the spider template. Sites inferred to be important to spectral tuning are close to the retinal binding region, see amino acids and approximate distance: MET79 (3 Å), CYS 297 (5.5 Å), Cys 195 (3.5 Å), and Cys 202 (4.0 Å).

Discussion

We generated the first, high-quality genome of Hedylidae to understand the genetic components underlying transitions to nocturnality. While visual opsin genes were a major focus, we also examined a range of phototransduction genes that displayed a conserved pattern of gene copy number. We employed a comparative genomics approach, analyzing multiple high-quality genomes of butterflies and moths to gain evolutionary insights into their visual systems. We also examined discordant gene tree topologies and signatures of diversifying selection, potentially tied to the emergence of nocturnality in this unusual butterfly lineage.

New genome assemblies are now being generated at a rapid pace, and they have created the opportunity to study molecular signatures of adaptation across diverse lineages. Genomic studies of non-model species have uncovered a wealth of new knowledge on diverse topics, such as polyphagy-linked gene family expansions47, long-distance moth migration48, genome size evolution49, and opsin evolution across Lepidoptera23. However, this influx of genomic data varies in taxonomic coverage and quality; often obscuring the genetic underpinnings of non-model species50,51. Within Lepidoptera, both Hedylidae and their sister group Hesperiidae (Skippers) demonstrate this disparity in genomic resources. In Hesperiidae, ~12% (n = 420) of the total species have publicly available reference genomes50,52 that have been widely utilized in evolutionary studies53,54, while Hedylidae have none. Previous studies have highlighted the need to address the paucity of hedylid genomes32,55,56.

Gains and losses of phototransduction genes are generally thought to be infrequent in insects despite a wide range of photic niches covered by different species34,57. Exceptions include opsin genes that have undergone numerous gene duplication events in dragonflies and damselflies58, and parallel losses of phototransduction genes in subterranean water beetles (Coleoptera: Dytiscidae)59. This trend was also supported by our orthology analysis (Fig. 1), with some exceptions in specific taxa and gene families, including duplications of opsin genes in many butterfly species. Notable duplications are also found in individual species, such as the Diacylglycerol lipase (DAGLβ) gene in M. leucophasiata and H. pyritoides, the Ddc gene in M. sexta and B. betularia, the Pis gene in M. sexta and the Pid gene in D. plexippus.

Among phototransduction gene families, opsins are the most extensively studied for gene duplications and losses, with LWRh exhibiting the greatest variability in insects60. Indeed, our opsin gene family tree shows more gene duplications in the LWRh clade than in any other (Fig. 2). The accumulation of LWRh in butterflies has been found to modify or expand spectral sensitivity60,61. Our analysis did not indicate an LWRh gene duplication along the branch leading to M. leucophasiata, suggesting that the adaptation to night vision was not due to an LWRh gene duplication.

Four of the six innexin genes were missing in our M. leucophasiata assembly. The synteny analysis showed that the four missing genes are located along a DNA segment that is purportedly the orthologous region of chromosome 4 of T. sylvestris (Supplementary Fig. 3). Hifiasm contig and unitig assemblies revealed that the 33 BUSCO genes located in this region are also missing in the contig assembly, but can be recovered from fragmented unitig sequences (Supplementary Table 2). Thus, the absence of the four innexin genes is likely an artifact of sequencing and/or assembly. Since unitig sequences are very fragmented and BUSCO genes from this region are mostly duplicated (identified from several overlapped small unitigs), we excluded unitigs from the final assembly. Although the genome was assembled with high-quality HiFi reads with deep sequence coverage (32x), we can not rule out the possibility that rearrangement events such as duplication occurred in this particular region and interfered with the assembly process. In Drosophila, innexin proteins were found forming gap junction channels and playing important roles in nervous system development during embryogenesis62, as developmental defects were observed when depleting or down-regulating innexin genes63,64. In addition, inx2 and inx3 play significant roles in vision, including eye disk development and possibly phototransduction processes in different insect species65,66,67,68. Macrosoma possesses adaptive characteristics for nocturnal vision, including larger relative eye size and an abundance of corneal nipples on the facet of the compound eye30. Experiments that test the relationship between adaptive eye morphology and the gain and loss of innexin genes may improve our understanding of the evolutionary basis of nocturnal butterfly vision. Finally, an unexpected DAGLβ gene duplication in M. leucophasiata was detected. DAGLβ was single-copy in most species studied and is thought to be involved in the phototransduction cascade and neural development, although its function has not been experimentally tested34. The two DAGLβ copies of M. leucophasiata are in a sister clade in our gene tree, suggesting that this duplication likely occurred after the origin of Macrosoma. The rare duplication of the DAGLβ gene in M. leucophasiata implies a potential role in the species’ adaptation to night vision.

Visual opsins play a key role in initiating the phototransduction cascade and are known to be considerably variable in their repertoire size across the animal kingdom. They have been well studied, and the link between sequence and wavelength sensitivity is becoming increasingly better understood. While non-adaptive mechanisms can affect the evolution of opsin sequences, adaptive forces such as changing light environments are more likely to cause consistent diversification22. Intense selective pressure imposed by changing light environments on insect visual systems has led to changes in the diversity of the opsin gene repertoire to adapt to altered sensory demands. Insects, due to their varying lifestyles, often exhibit distinct patterns of opsin gene expression. Notable differences in opsin expression between diurnal and nocturnal Lepidoptera were reported by Macias-Muñoz et al. 34, contrasting with Akiyama et al. 69 who found no significant differences. This suggests that while light environments can influence opsin expression and loss, other factors may also play a role in shaping these patterns.

Molecular adaptations of visual opsins are hypothesized to accompany diel niche transitions in Lepidoptera, conferring enhanced sensitivity under dim conditions70. We observed the clustering of hedylid opsins with moths in our dataset, which runs counter to species relationships. While other factors may be involved, the grouping of M. leucophasiata with nocturnal moths hints at possible convergent sequence evolution stemming from shared selective pressures. Previous studies used selection analyses to examine such discordance and characterize genes that evolve more rapidly than expected under neutral evolution71,72,73. We applied rigorous branch-site (aBSREL) and site-specific (MEME) models in HyPhy to identify specific lineages experiencing diversifying selection and found signatures of episodic diversifying selection across multiple opsin sequences (Supplementary Data 12). Positively selected sites and evidence of diversifying selection on the branches of blue (BRh) and UV (UVRh1) opsin sequences leading to M. leucophasiata hints at an adaptation to a crepuscular niche. Testing these amino acids and their proximity to the retinal binding region revealed that there is likely a shift in the wavelength response of the BRh, with four amino acids surrounding the region, although further functional testing of these amino acids is necessary. We find less evidence for functional pockets in the UV opsin. However, two amino acids, 88S and 149G, which do not directly appear to interact with retinal, were also identified as sites under positive selection in a previous study on Lepidoptera opsin evolution. These two amino acids are both 60 amino acids apart and appear on the same region of the 2D structure. In contrast, the blue opsin had minimal overlap in positively selected sites between the present study and the Sondhi et al. 22 with the hedylid blue opsin showing a greater number of positively selected sites, indicating that a larger amount of diversification may have taken place in this gene in this species. One of the copies of the LW opsin had only five transmembrane predicted domains, possibly signifying that it is non-functional in the same capacity as duplications in other clades that have retained the function.

Sensitivity to UV is important for night-active insects as it can aid integral behaviors such as detecting olfactory cues from night-blooming flowers74. Alternatively, long-wavelength (LWRh) opsin-expressing photoreceptors may play an even more critical role in maintaining sensitivity in dim-light environments. We found that LWRh opsin sequences cluster together, but there was no evidence for positive selection along the hedylid branch (Fig. 2). It is possible that M. leucophasiata has retained extended spectral sensitivity to longer wavelengths, which is prevalent and thought to have recurrently evolved in butterflies61,75. Although multiple LWRh sites were identified as being under diversifying selection, our analyses were limited by the lack of a proper characterization of spectral tuning sites, which would help determine whether these sites influence the visual range of LW opsins. While our selection analyses on opsins yielded valuable insights, we were unable to obtain RNA sequence data, which may have helped refine our gene models and further validate exon-intron boundaries. Studies on various organisms have also uncovered multiple opsin functions, such as in controlling circadian rhythms, neuronal signaling, and developmental patterning76,77. Future investigations should integrate opsin expression patterns and functional assays to help tease apart the relative contributions of visual adaptation versus other opsin-mediated processes in driving the molecular evolution of visual genes in nocturnal insects.

Conclusions

Our genome assembly of Macrosoma leucophasiata provides the first high quality reference genome for the nocturnal butterfly family Hedylidae (Table 1). Addressing this knowledge gap collectively reinforces the significance of genomics in broader ecological and evolutionary contexts. Beyond vision genes, our assembly can elucidate genetic underpinnings of other aspects of hedylid biology, such as the evolution of tympanal ears, circadian rhythms, and hostplant associations. Findings from selection analyses reported here can likely be examined further with transcriptomic profiling across diel categories. Future efforts to improve the assembly could also focus on generating chromosomal-scale data with synteny mapping. Our results showcase the power of leveraging genomic resources across lineages occupying diverse ecological niches.

Materials and methods

Sample preparation and sequencing

Two adult individuals of Macrosoma leucophasiata were collected in August 2016 at the Wildsumaco Biological Station, Napo, Ecuador (0°40'17.2"S, 77°35'55.1"W, 1400 m a.s.l., see permit information below). Butterfly wing vouchering and tissue storage methods followed Cho et al. (2016).

We extracted high molecular weight DNA from muscle tissue obtained from the abdomen and thorax using a Qiagen Genomic-tip DNA extraction kit following a previous study78. Following DNA extraction, DNA was visualized using pulse-field gel electrophoresis. DNA was sheared to ~20 kbp using a Diagenode Megaruptor 3. We size-selected the DNA for fragments greater than 10 kbp using a Sage Science BluePippin (Beverly, MA, USA) for library preparation. We prepared a PacBio HiFi library with the SMRTbell Express Template Prep Kit 2.0 (Menlo Park, CA, USA). The library was sequenced on a PacBio Sequel II on an 8 M SMRT cell in CCS mode with a 30-h movie time. Library preparation and DNA sequencing was conducted at the DNA Sequencing Center at Brigham Young University (Provo, Utah, USA).

Genome profiling and coverage estimation

We performed quality control checks on the read quality of the raw high-fidelity (HiFi) sequence reads using FastQC v 0.11.779. K-mer density distribution was assessed from HiFi reads using K-Mer Counter (KMC) v.3.2.180, with the k-mer length set to 21 nucleotides. We predicted the genome size, heterozygosity, and assessed genome quality from the k-mer distribution analyses in GenomeScope v2.081. A GenomeScope 2.0 profile (Supplementary Fig. 2) was generated using default settings for a diploid species. We used the resulting profile and other metrics to inform the choice of assembly parameters and allow stringency in error correction.

Genome assembly and decontamination

To assemble HiFi reads into contigs, we used Hifiasm v 0.16.182 using default parameters with aggressive purging level 3 (-l 3) to filter erroneous and low-quality reads. We used QUAST v5.2.083 to assess assembly statistics, contiguity, and GC content (Supplementary Data 1). We used purge_haplotigs v1.1.284 to sequentially identify and remove haplotigs and re-assign allelic contig pairings to generate a deduplicated assembly. The pipeline uses a haplotype-resolved assembly to generate mapped read coverages and aligns contigs using Minimap2 v. 2.2185 to flag suspect and junk contigs for removal.

We screened the purged assembly for contamination from non-target sequences, by examining GC-coverage plots in BlobTools v1.035. To assess read coverage, we aligned HiFi reads to the assembly using Minimap285 and sorted the aligned BAM file using samtools sort86,87. We used BLASTn88 with an E-value of 1e-25 against the NCBI nucleotide database to facilitate the taxonomic assignment of contigs. The resulting mapped files, coverage results, and BLAST output were used to generate BlobPlots35 (Supplementary Fig. 5). We identified and removed non-target sequences from the assembly after inspecting sequence coverage, proportion, and variation in GC content. Final assembly completeness was determined by repeating the BUSCO v5.3 calculation89,90 using the lepidoptera_odb10 database, and QUAST reports were compared to evaluate changes in assembly statistics after contamination removal (Supplementary Data 2).

Mitogenome assembly

We assembled the mitochondrial genome for Macrosoma leucophasiata using MitoHiFi v3.291. Raw PacBio HiFi reads and contigs from the unpurged Hifiasm assembly were used in two separate runs. We used the mitogenome of Macrosoma conifera with accession number NC_05085636 as a reference mitogenome. MitoFinder v1.4.092 was used to annotate the final mitogenome (Supplementary Fig. 6).

Repeat region annotation

We generated a de novo library of repeat sequences using RepeatModeler237. We soft-masked repeat regions with repetitive elements from various lines of evidence that were incorporated. These included the integration of simple and short repeats and the initially identified elements from RepeatModeler2. Additionally, repeats sourced from the lepidopteran entries in Repbase93,94,95 were utilized. We used the RMBLAST search within RepeatMasker v4.1.1 to complete soft-masking steps38.

Gene prediction and functional annotation

We used the automated BRAKER3 pipeline for gene structure prediction of protein-coding genes42,96,97,98,99,100,101,102,103,104,105,106,107,108. We ran BRAKER on the soft masked assembly using arthropod protein sequences from the OrthoDB v11 protein database40. We followed the GeneMark-EP+ pipeline for a final gene model prediction by running both AUGUSTUS and GeneMark-EP100,101. By relying solely on orthologous protein evidence, we aimed to overcome artifacts encountered during the transcript-mapping step. We assessed completeness of predicted gene models by comparing BUSCO scores from the lepidoptera_odb10 database for each BRAKER3 run. The program gFACs v1.1.2109 was used to generate summary statistics and create gene model profiles (Supplementary Table 4).

Functional annotation of predicted genes from transcript sequences was performed using the eggNOG-Mapper v2.1.9 web server41. We used stringent parameters for the annotation by restricting the taxonomic scope to ‘Arthropoda’ and including annotations with ‘experimental evidence only’. A separate annotation file was generated using recommended default settings. Gene ontology (GO) terms were filtered and extracted from eggNOG database annotations (Supplementary Data 4).

Species tree and orthology inference

We constructed a butterfly-focused species tree using 3376 BUSCO single-copy orthologs from 20 representative Lepidoptera species with high quality genomes. The 20 species included M. leucophasiata, 11 species from the six other butterfly families, and eight moth species, including one diurnal species of Sesiidae and Zygaenidae. Specifically, we downloaded genome assemblies from GenBank, RefSeq, and Darwin Tree of Life to run BUSCO using the same settings as those that were applied for the M. leucophasiata genome (Supplementary Fig. 7). Each single-copy ortholog (amino acid sequence) was aligned using the “mafft-linsi” command in MAFFT v7.490110 and all genes were concatenated into a single alignment. The maximum likelihood (ML) tree was constructed using IQ-TREE v.2.0.3111 with the “Q.insect+FO + G4” substitution model112. Branch supports were assessed using 1000 replicates of SH-aLRT support and ultrafast bootstrap with the “bnni” option to avoid overestimation113,114. The ML tree was converted to an ultrametric tree using treePL115, and a divergence time estimation analysis was conducted with a single calibration point for the common ancestor of Papilionoidea (95% CI; 110.3 to 86.9 Ma) following the conclusions of a previous study on butterfly evolution27. The best parameters were identified using a prime function with a smoothing value of 100, and the resulting tree served as the basis for subsequent gene family evolution analyses115. We re-annotated the 19 soft-masked publicly available genome assemblies using BRAKER3, and the Augustus models generated were selected for downstream analyses. For all 20 gene models, we used the longest transcript of each gene to perform orthology analysis in OrthoFinder v2.5.2, and phylogenetic hierarchical orthogroups (HOGs) were used to represent orthologs116. The gene count matrix and ultrametric tree were used together to detect rapid repertoire size changes for each HOG in CAFE v5.0.043,117. The lambda parameter (gene family evolutionary rate) was estimated and two gamma rate categories (k = 2) were assigned. We report rapidly evolving genes for the M. leucophasiata branch with a stringent significance level (p = 0.01) and assigned their function using consensus functions in eggNOG-Mapper.

Vision gene evolution

We investigated the evolution of vision genes by comparing these genes in our nocturnal butterfly genome with those of other lepidopteran species (Supplementary Fig. 8). We used three lepidopteran species (Danaus plexippus, Heliconius melpomene, and Manduca sexta) that have been closely studied for their phototransduction-related genes34, and used them for comparison and interpretation. We downloaded phototransduction amino acid sequences for these species and used BLASTp to identify these genes from the BRAKER gene models. We applied a strict E-value of 1e-50 because high sequence similarly was expected for intraspecific orthologous gene blast118. For the other 17 species, genes from the same HOGs with the blast-identified phototransduction genes were considered orthologs. We also manually extracted putative phototransduction-related genes from all species by searching keywords from the eggNOG annotations (Supplementary Data 7). Keywords were selected based on eggNOG annotations of the BLASTp-determined phototransduction-related genes of the three benchmark species. To ensure orthology, we further blasted these sequences against the NCBI non-redundant (nr) protein database using DIAMOND v2.1.8 with an E-value of 1E-5042,119. Genes without phototransduction-related results among the top 50 hits were removed from the vision gene model (Supplementary Data 8).

The downloaded genome assemblies used in this study were not all chromosomal or near-chromosomal level, and not all gene predictions had accompanying transcriptomic data (Supplementary Table 4). Thus, we corrected for the possibility of false gene duplication by applying three criteria. First, the sum of duplicated gene sequence lengths had to be within the range of mean ± standard deviation of other single-copy orthologs. Second, sequence similarities of duplicated genes had to be less than the mean pairwise sequence similarity of the ortholog. Third, genes had to be annotated in close tandem, as a span of five predicted genes. Sequence similarities (percent identities) were calculated using Clustal Omega v1.2.0120, and the falsely duplicated genes were assembled using emboss v6.6.0121 with other orthologous sequences as reference. Falsely duplicated genes were removed from the gene count matrix which was visualized using the R package “phyloheatmap”122 (Fig. 1 and Supplementary Data 9). Finally, we clustered the genes of vision-related HOGs and aligned sequences of each family using the mafft-linsi command in MAFFT v7.505110 and constructed gene family trees using IQ-TREE v.2.0.3111. We applied the best substitution model selected by ModelFinder, and all other settings were kept the same as those applied to reconstruct the species tree46.

To explore the cause of missing innexin genes (see Results), we first identified the position of these innexin genes in two skipper species, and found them on the same chromosome (chr4 in T. sylvestris and chr6 in P. malvae). We used genes in these reference chromosomes to identify orthologous contigs from our M. leucophasiata assembly and drew a linear synteny plot using syntenyPlotteR in R with the genes shared among HOGs123. Missing innexin genes and 84 additional missing genes are located along a single DNA segment that is absent in our M. leucophasiata assembly. Additionally, 33 BUSCO genes were found in the corresponding region in T. sylvestris. To investigate whether this segment is absent due to segment loss or artifacts from sequencing or assembly, we performed BUSCO analyses on the Hifiasm primary contig (p_ctg.gfa) and unitig assemblies (p_utg.gfa).

Selection analyses

We tested for positive selection on the three opsin genes (i.e., BRh, LWRh, and UVRh) because some phototransduction-related genes, especially those in the opsin gene family, showed discordant topologies with the species tree (see Results). Testing for positive selection can provide additional clues into whether discordant gene tree topologies were driven by selection or were due to methodological limitations. We first built individual gene trees in IQ-TREE using the same approach as we used to construct gene family trees. We used the transfer bootstrap expectation (TBE) metric to assess branch support and resulting trees were viewed in FigTree v1.43 (http://tree.bio.ed.ac.uk/software/figtree/) (Supplementary Data 11). For BRh and UVRh, we used the approximately unbiased (AU) test and constrained butterfly genes to form a monophyletic group (and to be the sister-group of genes from nocturnal moths). The AU test was conducted to determine if the constrained tree would be rejected by the given gene alignment at a significance level of 0.05. The LWRh gene was excluded from these analyses due to its potential out-paralogs that were shared among  tested species. Gene trees were reconciled with the rooted species tree (Fig. 1) using GeneRax124. We employed the undated DTL reconciliation model and SPR correction strategy in GeneRax to identify duplications, losses, and transfers. Reconciled gene trees were examined in the ThirdKind reconciliation viewer125 (Supplementary Data 13). Codon alignments (DNA sequences) were generated using aligned reference protein sequences from PAL2NAL v14126. These codon alignments, along with the reconciled opsin gene trees, were used to detect positive selection.

We employed the adaptive branch-site random effects model (aBSREL44) in HyPhy127 and ran this analysis via the Datamonkey 2.0 webserver128. Foreground branches were defined by branches of nocturnal species (Fig. 1). The mixed-effects model of evolution (MEME45) was employed to detect episodic positive selection on individual sites, allowing for variation in selection pressures across different branches and sites in gene trees. Additionally, aBSREL was use to identify positive selection on specific branches, employing a random effects framework to model variable selection pressures. Branches detected to be under positive selection were identified using the likelihood ratio (LRT) test statistic and Bonferroni-Holm corrected p-values. We marked branches and nodes with signatures of positive selection on the gene family tree (Fig. 2 and Supplementary Table 2). We visualized and annotated the opsin gene family tree using the Interactive Tree of Life (iTOL) v. 6.5.7 tool129.

Protein modeling and mapping positively selected sites

We obtained predictions of transmembrane helix prediction for all three opsins using Phobius130 implemented through Protter131 using the Hedylidae blue and UV opsin sequences. LW was modeled, but since there was a duplication and we were unsure which sequence was functional, we did not model its 3D structure. UV opsin and blue opsin sequences recovered all seven transmembrane domains, similar to the X-ray crystal structure of known invertebrate opsins and GPCRs, and we used these for the 3D protein structure modeling in Swiss-model132. We chose the jumping spider opsin X-ray structure as a template (6i9k), because it had the highest identity-score and coverage (blue: GMQE:0.66, identity = 36.26, UV: GMQE:0.65, identitity = 39.94). We included retinal in the model from the spider opsin structure and identified putative retinal binding sites, i.e., amino acids less than 4.0 Å away (Fig. 3A, B). 4.0 Å is the length of a weak hydrogen bond (the longest bond that opsin usually makes with retinal).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.