Main

Parallel evolution is the repeated evolution of the same phenotype or genotype in different populations1,2,3. Although it is clear that organisms experiencing similar selective pressures often display parallel phenotypic changes4,5,6, less is known about the extent of parallel evolution at the genomic level. Theoretical studies have shown that the probability of parallel genetic evolution increases with the strength of selection, decreases with the number of possible alleles and is particularly likely to occur for genes with large phenotypic effects7,8,9. Therefore, architectures that limit genetic variability, or the independence of this variability, will reduce the number of possible adaptive solutions, thereby increasing the repeatability of evolution.

Chromosomal inversions have received much recent attention as important genomic architectures facilitating adaptation and speciation10,11,12. By capturing favourable alleles and suppressing recombination between them, inversions are thought to permit local adaptation in the presence of gene flow. However, because locally adapted alleles are tightly linked to the inversions, the range of adaptive responses is limited when populations encounter a new environment13. Moreover, the combined phenotypic effect of alleles within inversions is likely to be greater than that of individual alleles10,14. Therefore, inversions should increase the likelihood of parallel genomic evolution in different populations. Although inversions have been associated with traits and loci involved in parallel evolution in various taxa15,16,17, whether inversions have contributed disproportionately to parallel genetic evolution has not been formally tested.

We explored the role of inversions in parallel genetic evolution by studying two dune ecotypes of prairie sunflower, Helianthus petiolaris ssp. fallax, that inhabit active sand dunes in the Great Sand Dunes National Park and Preserve (GSD), Colorado, and around Monahans Sandhills State Park (MON), Texas18,19,20. The dunes at GSD are characterized by lower nitrogen levels and lower vegetation cover than the surrounding sand sheet18. The dunes at MON probably share these features but have not been thoroughly surveyed. The dune ecotype in GSD has been shown to have seeds that are three times heavier than the non-dune ecotype at the same location, a trait that contributes importantly to dune adaptation21. The plants from the dunes in MON also tend to have large seeds22, but the adaptive relevance of this trait has not been systematically investigated at this location. Demographic analyses have suggested recent origins of the dune ecotypes and substantial gene flow between dune and non-dune ecotypes in GSD and MON18,20. However, there is no evidence of direct gene flow between the two dune ecotypes. Previous studies have identified large inversions in H. petiolaris ssp. fallax segregating in one or both of the dune ecotypes (Supplementary Table 1) and found that several of these inversions are associated with seed size, flowering time and soil fertility22,23, suggestive of a role of inversions in controlling locally adapted traits. Here, we used diverse comprehensive datasets across the habitats to examine the extent of parallelism and how inversions respond to similar environments and contribute to phenotypic shifts in the two dune ecotypes (Extended Data Fig. 1). We paired newly collected habitat information with newly generated genotyping-by-sequencing (GBS) data across populations in MON and performed genotype–environment association (GEA) analyses to test whether inversions were associated with shared selective pressures in MON and GSD. We then systematically investigated seed size of dune and non-dune populations in MON and conducted quantitative trait locus (QTL) mapping to identify the genetic regions underlying seed size differences in both GSD and MON. Last, we made use of the available whole-genome sequencing (WGS) data for wild sunflowers and applied population genomics approaches to identify selective sweeps (loci under positive selection) shared by the dune ecotypes to test the hypothesis that inversions contribute disproportionately to parallel genetic evolution. The phylogenetic relationships of the shared regions were also examined to understand the source(s) of genetic variation that contribute to parallel genetic evolution.

Results and discussion

Independent origins of the dune ecotypes

To assess patterns of genomic divergence of the dune ecotypes, we used a WGS dataset of 259 H. petiolaris from previous studies22,24. In a principal component analysis (PCA) based on 387,619 unlinked single-nucleotide polymorphisms (SNPs) in this dataset, dune and non-dune samples from the same location clustered together, implying independent origins of the dune plants (Fig. 1a). This pattern was also supported by a neighbour-joining tree of the samples (Supplementary Fig. 1).

Fig. 1: PCA of genomic and habitat data for dune and non-dune ecotypes of H. petiolaris ssp. fallax.
figure 1

a, PCA of 259 samples of H. petiolaris ssp. fallax using WGS data. The shared non-dune samples are samples from non-dune populations located more than 100 km away from either location. b, PCA of the habitat characteristics for each site. The arrows represent the relative contributions of percent ground cover (Cover), percent ground cover made up of grass (Grass), nitrogen (N), calcium (Ca), phosphorus (P), magnesium (Mg) and potassium (K) for each site measured.

Parallel habitat shifts in dunes

We quantified environmental parallelism between the dune and non-dune habitats at each location. Contrary to the results based on genetic data, PCA on all seven soil and vegetation-related variables across the two locations revealed that the first axis of environmental variance (PC1) separated dune and non-dune habitats (Fig. 1b). PC2 mostly separated the dune sites from the two locations, but the non-dune sites clustered together along this axis. At both locations, non-dune sites had significantly higher levels of plant cover, total nitrogen, phosphorous and calcium. In addition, the non-dune sites at MON, but not GSD, had significantly higher levels of grass cover, potassium and magnesium (Fig. 1b and Supplementary Table 2).

Parallelism in GEA

To compare GEA patterns in the dunes at MON with previous results for the samples at GSD23, we generated new GBS data for a wide collection of samples (Supplementary Table 3) and performed SNP calling and GEA analyses for eight variables: percentage cover, percentage cover that is grass, percentage cover of sunflowers, nitrogen, phosphorus, potassium, magnesium and calcium23 (Methods). In the GEA of MON samples, we found multiple large genomic regions with consistently high association, most of which overlapped with the inversions (Supplementary Figs. 2–9).

We compared the GEA results from MON with those in GSD for the same variables. To assess genomic parallelism, we summarized the results of SNPs in non-overlapping windows of 0.025 cM and compared the observed parallel windows against randomly permuted null distributions25 (Methods). A total of 123 ecotype-associated windows (Nexp = 11.8297; P < 1 × 10−5) were shared between the two locations (Extended Data Fig. 2). All of the parallel windows were found within inversions pet09.01 and pet11.01. For individual environmental variables, we observed significantly more parallel windows than expected for five environmental variables (plant cover, grass cover, total nitrogen, phosphorus and potassium; Extended Data Fig. 2).

To determine the possible role of inversions themselves in the evolution of the dune ecotypes, we also genotyped all 11 inversions found in H. petiolaris ssp. fallax using a supervised machine learning algorithm and performed GEA analyses in which putative inversions were treated as single biallelic loci. We used inversion genotypes from 427 WGS samples of H. petiolaris22 as training sets and extracted all shared SNPs between the GBS samples of GSD and MON and the WGS samples within each inversion as predictor variables. This circumvented the issue of low SNP density in the window-based analysis. In total, we obtained 48–377 overlapping SNPs between the GBS and the WGS dataset for each inversion in GSD and 51–631 SNPs in MON. After training, the out-of-bag error rates of the models were generally lower than 5%, except for pet07.01 and pet17.03 in GSD, for which the numbers of overlapping SNPs were small (Supplementary Table 4).

Results from the GEA of inversions were generally correlated well with those for the SNPs inside them (Supplementary Figs. 2–9). When comparing the results for MON and GSD, we found that three inversions (pet05.01, pet09.01 and pet11.01) were associated with dune ecotypes in both GSD and MON (Fig. 2 and Supplementary Fig. 2). However, five inversions were associated with dune ecotypes in only one of the locations: pet07.01, pet17.01 and pet17.03 with dunes at GSD; and pet10.01 and pet14.01 with dunes at MON (Fig. 2 and Supplementary Fig. 2). For individual environmental variables, the three inversions that were associated with dune ecotypes in both locations were also associated with cover, nitrogen and phosphorus in both locations, whereas these inversions were associated with potassium and magnesium only in MON (Fig. 2 and Supplementary Figs. 2–9). These results were consistent with the PCA of environmental variables, in which cover, nitrogen and phosphorus contributed to the parallel habitat shift of the dune and non-dune habitats, whereas potassium and magnesium only varied across habitat types at MON, suggesting parallel responses of chromosomal inversions to shared selective pressures. Calcium, another variable that showed parallel shifts between the dune and non-dune habitats, was associated with pet05.01 at both locations but with pet09.01 only at GSD and pet11.01 only at MON, probably owing to generally weaker associations with this variable, possibly coupled with genotyping error of the inversions. Besides these parallel associated inversions, five other inversions that were associated with dunes in only one location were also associated with several or all environmental variables at the corresponding location. For instance, in GSD, pet07.01 was associated with cover, nitrogen, phosphorus, potassium and magnesium; and pet17.01 and pet17.03 were consistently correlated with cover variables in GSD. For MON, pet10.01 probably contributes to adaptation to low nitrogen, potassium and magnesium content in the dunes, whereas pet14.01 was strongly associated with all variables tested.

Fig. 2: GEA of the inversions in H. petiolaris ssp. fallax.
figure 2

For each variable, coloured circles indicate significant associations of the inversions in GSD (red), MON (blue) or both (purple). Names of inversions are consistent with those used in ref. 22.

Parallel genetic changes underlying seed size differences

We examined whether similar shifts in seed weight had taken place in dune ecotypes in MON by investigating plants in natural populations (Supplementary Table 3) and in a common garden. Dune seeds were on average 3.9 mg heavier than non-dune seeds (likelihood ratio (LR) = 30, d.f. = 1, P < 0.001) across both experiments (ecotype × experiment interaction was not significant; LR = 2.4, d.f. = 1, P = 0.12), and seeds collected from natural populations were 2.4 mg heavier than those collected under common greenhouse conditions (Fig. 3; LR = 12, d.f. = 1, P < 0.001). The substantial difference in seed size in dune versus non-dune populations of MON was consistent with the seed size difference in GSD21, suggesting that natural selection is responsible for larger dune seeds21,25.

Fig. 3: The weight of seeds produced by MON plants in natural populations and grown in a common garden.
figure 3

Each point represents the average seed weight of an individual plant based on at least five seeds. Lines with error bars represent the mean and 95% confidence intervals of the seed weight produced by plants from each habitat type grown under different conditions.

We made two mapping populations for each location by reciprocally crossing a large-seeded dune individual with a small-seeded non-dune individual (Supplementary Tables 3 and 5). In all cases, F1 plants produced intermediate seed sizes, and each F2 mapping population produced a roughly normal distribution spanning most of the variation between the grandparent populations (Extended Data Fig. 3). Cytoplasm type did not have a detectable effect on seed size except for the GSD1 F2s, for which there was an extremely minor effect (Extended Data Fig. 3 and Supplementary Table 6).

We analysed F2 samples using a selective genotyping approach26 and found many genomic regions that were overrepresented in large-seeded plants (Extended Data Fig. 4), consistent with seed size being highly polygenic27,28,29,30,31,32. These regions were more likely to be from the dune parent than the non-dune parent in each mapping population (P < 0.001; Extended Data Fig. 4). The strongest signals were generally found within inversions, such as pet11.01 in cross GSD2 and pet14.01 in cross MON2 (Extended Data Fig. 4); these inversions are likely to be large-effect QTL that maintain trait divergence in the face of substantial gene flow33. We also inferred the inversion genotypes of the samples using the same algorithm as in GEA analyses. Despite fewer overlapping SNPs between the GBS and the WGS data and higher out-of-bag error rates than those in GEA analyses (Supplementary Table 7), we found that pet09.01 and pet11.01 showed significant frequency differences between large-seeded samples and small-seeded samples in MON1, whereas pet07.01, pet10.01 and pet14.01 were detected in MON2. In GSD mapping populations, pet11.01 was the only inversion that showed a significant association (Extended Data Fig. 4).

By summarizing genotypes in windows of 1,000 variants and conducting randomization tests, we found that QTL with significant seed size associations fell into the inversions more than expected by chance in all four mapping populations (Extended Data Figs. 4 and 5), and that more significant QTL windows were shared than expected by chance in pairs within and between locations (Table 1 and Extended Data Figs. 4 and 5). The extent of parallelism within versus outside the inversions varied for each pair of mapping populations from different locations. However, in three of the four pairs, there was significantly more parallelism within versus outside the inversions (Table 1 and Extended Data Fig. 6), suggesting that the inversions (in particular, pet05.01, pet09.01, pet11.01 and pet14.01) disproportionally contribute to parallel shifts in seed size.

Table 1 The extent of shared windows between seed size QTL in pairs of mapping populations

Although inversions displayed the strongest signals in our analysis of seed size, the levels of associations for most inversions varied among the crosses. There are multiple explanations for this. First, although several inversions probably contained genes for seed size, selective genotyping may have biased QTL detection towards the largest effect allele that was segregating in each population. A second issue is that the F2 mapping populations were monomorphic for certain inversions (Extended Data Fig. 4). This was because the crosses were made before we were aware of the existence of segregating inversions. However, even for monomorphic inversions, we sometimes found an association with seed size. For example, in MON1, SNPs within the region of pet05.01 showed significant frequency differences between large-seeded and small-seeded samples, even though the inversion was not segregating in the mapping population. Thus, some of the loci for seed size could still make substantial contributions without the presence of inversions, as has been demonstrated previously in Helianthus34. Regardless of the cause, we still found more shared QTL windows within the inversions.

Disproportionate contribution of inversions to parallel adaptive divergence

We identified selective sweeps in each of the dune ecotypes using three approaches that take advantage of different characteristics of genetic polymorphisms. For the composite likelihood ratio (CLR) statistic computed using SweepFinder2 (ref. 35), we obtained a total of 3,909 outlier regions in GSD and 8,960 in MON. Using the other two approaches that compared dune and non-dune populations, we found 7,159 and 7,742 outlier regions of reduction of diversity (ROD) and 5,907 and 4,569 outlier regions of FST, in GSD and MON, respectively. We found 652 and 851 selective sweep regions supported by all three statistics in GSD and MON, averaging 145,339 base pairs (bp) and 46,515 bp, respectively.

After searching for regions that were shared between the two locations, we identified 56 parallel regions across these independently derived ecotypes. Among the parallel regions, 44 regions covering 38.7 Mbp were found to reside within known inversions (Fig. 4), with several large regions covering substantial proportions of inversions pet11.01, pet17.01 and pet17.03, and a number of scattered regions found in inversion pet09.01. This is consistent with previous findings of high sequence divergence associated with inversions in the dune ecotypes22,23,36. By comparing the number and length of parallel regions within inversions with the null distribution from permutations, we also found significant enrichment of parallel regions in three inversions that showed parallel divergence between GSD and MON (pet05.01, pet09.01 and pet11.01; P < 1 × 10−4 for number of regions, and P = 0.024 for total length). The enrichment remained significant for number of regions when testing on all inversions (P < 1 × 10−4), but not for total length (P = 0.17). After linkage disequilibrium (LD) pruning with a 95% threshold of the genomic null distribution, we kept 37 regions and still found significant enrichment of parallel regions in pet05.01, pet09.01 and pet11.01 (P < 1 × 10−4 for number of regions, and P = 0.003 for total length) and across all inversions (P < 1 × 10−4 for number of regions, and P = 0.018 for total length). This indicates a disproportionate contribution of these three inversions to parallel genomic divergence of GSD and MON.

Fig. 4: Parallel regions in GSD and MON.
figure 4

Selective sweep regions and overlapping regions (parallel regions) are indicated by bars of different colours: red, GSD; blue, MON; purple, parallel regions. Parallel regions of less than 1 Mbp are represented by thicker bars for better visualization. Black bars show the locations of large non-recombining haplotype blocks (mainly due to inversions) identified in ref. 22 that are polymorphic within H. petiolaris ssp. fallax.

A general caveat for interpreting our results is that because of the block inheritance of inversions, parallel adaptation on any gene within an inversion could leave an inversion-wide signal. This could increase the likelihood of detecting the signature of parallelism, as well as the number and length of parallel regions37. Although this may have contributed to the large inversion signal reported here, there were also several large clusters of swept regions outside the inversions seen in each location, such as those found near the middle of chromosomes 2 and 10 in GSD, and in the middle of chromosome 12 and at the distal end of chromosome 16 at MON (Extended Data Fig. 7). This clustering could be viewed as evidence of ‘divergence hitchhiking’, in which new locally adapted mutations establish preferentially in physical linkage to already diverged genes, thereby generating genomic islands of differentiation38,39,40. However, there may be more prosaic reasons for these clusters. The mid-chromosome swept regions in GSD were in areas of low recombination, which probably explains their large size. The clusters of swept regions on chromosomes 12 and 16 were coincident with a translocation between the two chromosomes, which appears to be polymorphic at MON41. Significantly, unlike inversions, the low recombination regions on chromosomes 2 and 12 (and elsewhere in the genome) did not show a strong signature of parallelism. Likewise, despite the large number of chromosome translocations differentiating sunflower taxa41, translocations do not appear to have contributed to parallel genetic evolution in the dune systems analysed here. Thus, inversions stand out as one of the most influential genetic architectures facilitating parallel genetic evolution.

Despite the disproportionate contribution of inversions to parallel genetic evolution overall, not all showed signals of parallel adaptive divergence. Although inversion pet05.01 was found in both dunes and displayed associations with all variables in MON in GEA analysis, its frequency was relatively lower in the MON dunes (Extended Data Fig. 8); thus, it did not show a detectable signal in the WGS dataset, which had fewer samples from this region. Likewise, pet14.01 increased significantly in frequency in the dunes in one location but only marginally in the other (Extended Data Fig. 8). The inversion clearly facilitates ecological divergence in MON according to GEA, QTL and selective sweep analyses but was at very low frequency in GSD22,23. The lack of parallelism for these inversions might be due to environmental differences between GSD and MON such as relative magnesium and potassium content (Fig. 1b), dune size, dune height, temperature and biotic communities. These differences may generate differences in selective pressures on these inversions between the two sites. Another possibility is that the inversion haplotypes found in GSD and MON no longer share the same adaptive variants, leading to different responses to selection.

Sources of parallel regions

Knowing the source(s) of genetic variation giving rise to parallel sweeps is important to interpret patterns of genomic parallelism. We conducted phylogenetic analyses for 12 parallel regions that contain sufficient numbers of SNPs for reliable phylogenetic reconstruction (6 within and 6 outside of inversions). For most regions outside of inversions, the dune samples from each site were closest to those from local non-dune populations, and the alleles found in the dunes likely originated from intraspecific genetic variation (Fig. 5). This suggests that the alleles found in the dunes were independently drawn from the non-dune populations in each of the dune ecotypes. In some regions outside of inversions, such as a region found on chromosome 7, the dune samples at MON showed deep divergence with other samples from the same species, suggesting the dune allele may come from introgression from another species, while the one at GSD arose from within H. petiolaris (Fig. 5).

Fig. 5: Topological weighting of parallel regions.
figure 5

a–l, Ha412HOChr09:107701884–116410572 (a), Ha412HOChr11:5890343–5916951 (b), Ha412HOChr11:59695130–62373169 (c), Ha412HOChr16:140217167–142747183 (d), Ha412HOChr17:19185748–19215393 (e), Ha412HOChr17:190415052–191878407 (f), Ha412HOChr02:82397894–82927898 (g), Ha412HOChr04:91990564–139990182 (h), Ha412HOChr06:101167874–106607943 (i), Ha412HOChr07:68115435–68247437 (j), Ha412HOChr12:96918451–97184453 (k), Ha412HOChr13:105437209–105575210 (l). Regions are grouped by their positions relative to inversions. The tested scenarios regarding the source(s) of dune-adapted variation are displayed on the right: habitat, indicative of parallel genotypic evolution (blue); geography, indicative of non-parallel genotypic evolution (green); other (purple); within-species diversity (red); introgression/ancient variation (orange); subspecies introgression (yellow). The genomic positions of the parallel regions (red triangles) and the inversions (black bars) are shown at the bottom.

For all regions within inversions (except pet16.02), we consistently found strong support for parallel evolution of the same dune haplotypes in GSD and MON (Fig. 5). In the parallel regions, the dune haplotypes of the inversions were often found to group with H. petiolaris ssp. petiolaris, suggesting the existence of this adapted dune haplotype in the other subspecies. The results of the region in inversion pet17.03 indicated a more ancient origin of the dune haplotypes (Fig. 5). This was consistent with previous discoveries that the dune haplotypes of these inversions probably originated via introgression with an unknown and possibly extinct donor species22. Regardless of the donor, the same haplotypes of these inversions were targeted by selection at the two sites, indicating parallel evolution of the inversions from standing genetic variation. The dune-adapted inversions were probably introduced to the species before the establishment of the GSD and MON populations, segregating at low frequency in the non-dune habitat22 and raised to high frequency in parallel by natural selection in the dunes. Selected alleles at the same loci in different populations could have multiple independent mutational origins42,43,44 or could arise by selection on the standing variation present in the ancestral populations45,46. In contrast to independent de novo mutation, adaptation from standing genetic variation is likely to be rapid47. The retention of inversion polymorphisms within H. petiolaris might have provided a directly available pool of adapted alleles for the dune environment, which promoted the reuse of these inversions in each site. As these inversions may represent ancient introgressions22, these alleles might have been pretested by selection in past environments, further increasing the probability of parallel evolution.

Conclusions and future directions

Frequent reuse of inversions during the repeated colonization of dune habitats suggests strong biases and constraints in adaptive evolution. Through comprehensive sampling across habitats and genetic mapping, our GEA and QTL analyses demonstrated that these inversions contain locally adapted alleles of multiple ecologically relevant traits. By maintaining genotypic combinations at loci affecting adaptation, inversions limit the genetic options for local adaptation, thereby increasing the potential for parallel genetic evolution13. Furthermore, the combined phenotypic effect of alleles in an inversion is likely to be larger than that of individual alleles, increasing the likelihood of inversions being established by natural selection and maintained under gene flow selection balance9,14,48. In our GEA and QTL analyses, the strongest associations were usually found at SNPs within inversions or with the inversions themselves, indicating that these inversions are major contributors to adaptation to low vegetation cover and lack of nutrients in the dunes, consistent with theoretical predictions of parallel genetic evolution. Overall, we show that inversions are more strongly involved than the rest of the genome in GEAs, the genetic basis of a locally adapted phenotype and selective sweeps in the dune ecotypes. Together, these results show that inversions are disproportionally responsible for the parallel adaptation of sunflowers to two dune habitats.

Notably, our phylogenetic analyses showed that although shared sweeps within inversions were mostly parallel in the strictest sense (that is, swept haplotypes within the inversions derived from the same ancestral source), this was generally not the case for shared sweeps outside of the inversions, which mostly derived from different variants found in the local non-dune population (Fig. 5). This suggests that we might be underestimating the extent of parallelism within inversions compared with other regions of the genome.

Work in other systems has also found evidence that inversions play a role in parallel adaption (for example, refs. 49,50,51). However, this is not a universal pattern52,53. For systems where parallel reuse of inversions in repeated adaptation has been observed, analyses similar to those that we conducted here would offer stronger support for the importance of inversions in adaptive evolution. When compared across systems, inversions appear to be particularly abundant and relevant to local adaptation in systems such as sunflowers, Littorina snails54, mice55 and monkeyflowers56, in which there are sharp ecological transitions and ongoing gene flow, consistent with theoretical prediction10.

Our study represents a first step towards understanding the evolutionary and ecological factors that contribute to parallel genetic adaptation. We recommend that future studies estimate the strength of selection on parallel swept regions in the two dune ecotypes using a combination of modelling9,57 and empirical approaches (for example, ref. 25). Studies that dissect the genetic architectures of other ecologically relevant traits58,59, including identification of the causal genes and mutations, would illuminate both the cause(s) of parallel evolution of adaptive traits and the particular evolutionary and demographic histories of the genetic variation underlying parallel adaptation45. Lastly, dune adaptation has occurred many times in Helianthus, involving half a dozen different species or subspecies60. Extending the approach taken in the present study to these other systems would be likely to lead to further insights into the factors that facilitate or constrain parallel genetic evolution.

Methods

Habitat analysis

Habitat characteristics for GSD populations were from ref. 18. In 2015, we visited ten non-dune and six dune populations within Monahans Sandhills State Park, Texas, and two dune populations outside the park (Supplementary Table 3). We established a transect through each population and picked five sites for habitat analysis at even intervals along those transects. At each site, we took a photograph of a 0.65 m2 quadrat and used ImageJ61 to determine the proportions of vegetative cover and grass cover. Soil samples were taken from 25 cm deep, dried at 60 °C for 24 h and pooled from the five sites at each population. Available phosphorus and exchangeable potassium, magnesium and calcium were measured at A&L Eastern Laboratories and total nitrogen content was determined by Micro-Dumas Combustion (NA1500, Carlo Erba Strumentazione) at the University of Georgia Analytical Chemistry Laboratory. For each habitat trait at each location, we fit linear models in R v.4.0.3 (ref. 62) to determine whether ecotype was a significant explanatory variable. To see the effects of all variables at once, we imputed missing data (a maximum of three individuals had missing data for any variable) using R package missMDA63 and conducted PCA on scaled variables using the prcomp function.

GEA analyses

We collected mature seeds from 20 individuals at each MON population mentioned above and germinated seeds from 10–12 individuals per population, making a total of 195 samples for GEA analysis (Supplementary Table 3). We extracted DNA from these samples using DNeasy Plant kits (Qiagen) and prepared a GBS library following a modified version of the ref. 64 protocol19. The library was paired-end sequenced on an Illumina HiSeq. We called variants against the HA412-HOv2 Helianthus annuus from ref. 65 and filtered the SNPs for downstream analyses using the parameters reported in ref. 23, except that individual genotypes with depth less than 5 were set as missing owing to the lower depth of this library. Samples with excessive missing data (>50%) were removed from the dataset, leaving 174 samples for downstream analyses.

The GEA analysis was conducted using BayPass v.2.1 (ref. 66). Population structure was estimated by choosing 1,000 random SNPs in low LD (R2 < 0.1) and running BayPass in core model mode. The covariance matrix from this analysis was used as a control for population structure. We then ran BayPass under the standard covariate model using the default importance sampling estimator approach. Each environmental variable was scaled using the -scalecov option. Owing to missing data for total nitrogen in population MON001, the analysis was run separately for total nitrogen and other variables. We also simulated pseudo-observed data (POD) with 1,000 SNPs using the ‘simulate.baypass’ function, and analysed the newly created POD for each environmental variable as described above. SNPs that had a BF above the top 1% quantile of those of the POD were treated as associated SNPs. To identify genomic parallel divergence between dune and non-dune ecotypes in GSD and MON, we also coded the populations according to their ecotypes and ran BayPass by treating the ecotypes as covariates. No scaling was performed for the binary variable.

We then compared the GEA results of MON and GSD for shared variables. To test for parallelism between locations, we divided the chromosomes into non-overlapping windows of 0.025 cM and calculated the average BF value for each window by summarizing the SNPs within a range of 0.5 cM. Windows with a BF above the top 1% quantile of the corresponding POD were treated as associated windows. As a null distribution, we randomly drew N 0.025 cM windows from each location, where N is the count of associated windows for each location, and then assessed the overlap of randomly selected associated windows between the locations over 10,000 iterations26.

Seed size analysis

To investigate whether similar shifts in seed size had taken place in the dune ecotype in MON, we weighed groups of five seeds collected from 16–21 individuals from the same populations used in the GEA analyses above (Supplementary Table 3). To explore phenotypic plasticity, we grew 1–4 plants from each of two dune and three non-dune populations that had been collected previously (Supplementary Table 3) under standard greenhouse conditions and weighed 6–20 individual seeds produced by these plants. We fit a linear mixed effect model using R package nlme67 to describe mean individual seed weight across both datasets. The model included ecotype, experiment (wild collected versus common garden), and their interaction as fixed effects and population as a random effect. We identified significant model terms by comparing nested models using LR tests.

QTL mapping

We made two mapping populations for each location by reciprocally crossing a large-seeded dune individual with a small-seeded non-dune individual (Supplementary Table 3). As sunflowers are self-incompatible, we grew up several F1 seeds produced by each of these crosses and crossed pairs of reciprocal F1s. This resulted in F2 seed lots with dune and non-dune cytoplasms for each of the four cross types (Supplementary Table 5). We grew up 186–325 F2 plants of each cross-by-cytoplasm type under our standard greenhouse conditions (Supplementary Table 5). In addition, we grew a few plants from wild collected seeds (half siblings of the grandparent plants) and F1 seeds (full siblings of the parent plants) used to make the F2 seed lots. F2s were mainly open pollinated, and groups of five seeds per plant were weighed.

Using a selective genotyping approach26, we genotyped the 12 plants that produced the heaviest seeds and the 12 plants that produced the lightest seeds from each of the eight cross-by-cytoplasm types, for a total of 192 plants. Genotyping was done using the methods described above, and biallelic variants were filtered for heterozygosity < 95%, minor allele frequency > 15% and coverage > 75%. We also used the parents of each mapping population to determine which allele came from the large-seeded dune parent. For each mapping population, we performed Fisher’s exact tests at each locus to determine whether there was a higher proportion of dune alleles in large-seeded samples than in small-seeded samples. We used Bonferroni corrections based on the total number of variants and the number of independent regions in the dataset as estimated by PCA68 to define significance thresholds.

We tested for parallelism in seed size QTL by determining the most frequent genotype (homozygous dune, homozygous non-dune or heterozygous) in windows of 1,000 variants for each mapping population and performing Fisher’s exact tests on these windows as above. This yielded 1,548 genotyped windows in all four mapping populations. We randomized the results of these tests 10,000 times and counted the number of windows significantly associated with larger seeds (P < 0.05) in pairs of mapping populations and in inversions. We did a second set of randomizations across the windows inside and outside inversions separately and counted the numbers of significant windows shared across pairs of mapping populations. This allowed us to determine the extent of parallelism inside and outside the inversions and whether there was a significant difference between those rates.

Genotyping and parallelism analyses of inversions

We also performed GEA and QTL analyses in which inversions were treated as single biallelic loci. For each of the GBS samples in the aforementioned GEA and QTL analyses, we determined their genotypes at all 11 inversions found in H. petiolaris ssp. fallax (Supplementary Table 1) using a supervised machine learning algorithm, with SNPs and inversion genotypes from 427 WGS samples of H. petiolaris22 as training sets. We extracted shared SNPs between the GBS and WGS samples within each inversion as predictor variables. The inversion genotypes of the WGS samples were marked as categorical variables and used as the response variable. A random forest model was constructed for each inversion using the function ‘randomForest’ in R package randomForest69. The number of decision trees was set to 151, and the number of randomly sampled variables for each tree was determined by traversing different numbers of SNPs and choosing the value with the lowest prediction error rate in the training set. The trained model was subsequently used to infer inversion genotypes in the GBS samples.

We performed GEA analyses for the inversions using the same procedure for SNPs as described above. We also reran inversion GEA analyses for the GSD samples using the new selection of inversions and inversion genotypes determined by the random forest method. Inversions undergoing parallel divergence between dune and non-dune ecotypes in GSD and MON were identified using the procedures mentioned above. Inversions that had a Bayes factor (BF) above the top 1% quantile of those of the POD were treated as significantly associated. To determine associations between the inversions with seed size, we treated them as single biallelic loci and conducted Fisher’s exact test analysis using the procedure described above for SNPs.

WGS and SNP calling

To assess genome-wide patterns of adaptive divergence, we used WGS samples and an SNP dataset for H. petiolaris from previous studies22,24. The previous intraspecific SNP set contained 7,107,746 biallelic SNPs across 259 WGS samples of H. petiolaris ssp. fallax, including 11 dune ecotype samples from GSD, 27 dune ecotype samples from MON, and 221 non-dune samples from the sand sheet below the dunes and from other populations of H. petiolaris ssp. fallax (Supplementary Table 8).

For analyses that required polarization of the SNPs or outgroups, we generated another variant dataset for the species using all available samples for H. petiolaris, as well as samples from four outgroup perennial sunflowers (664647_GIG from Helianthus giganteus, DEC_1895 from Helianthus decapetalus, DIV_1956 from Helianthus divaricatus, and GRO_2043 from Helianthus grosseserratus; ref. 22). Sequences of these samples were aligned to a reference genome of H. annuus (Ha412-HOv2; ref. 65) using NextGenMap v.0.5.3 (ref. 70). For each sample, a GVCF file was produced with ‘HaplotypeCaller’ in GATK v.4.1.4.1 (ref. 71) with parameter --heterozygosity 0.01. After individual variant calling, all samples of H. petiolaris and the four perennial samples were jointly genotyped using GATK’s ‘GenomicsDBImport’ and ‘GenotypeGVCFs’ on 1 Mbp genomic regions without transposable elements, and the raw VCF chunks were then gathered by chromosome using ‘GatherVcfs’. Variant quality score recalibration was performed using the top 20 samples with the highest sequencing depth in H. petiolaris as a ‘gold set’. The raw variants were filtered to remove sites with extremely high heterozygosity (ExcessHet z-score < −4.5), and the gold set was then applied against this filtered set of variants to produce recalibration models for SNPs and indels using ‘VariantRecalibrator’. The 90% tranche for each cohort was selected based on these recalibration models using ‘ApplyVQSR’. After variant quality filtration, the variants were further filtered to retain only biallelic SNPs with minor allele frequency >0.01 and genotyping rate >50% across all samples. This generated a total of 21,836,165 biallelic SNPs. By comparison with the outgroup species, we obtained information on ancestral states for about 58% of all SNPs. Unpolarized sites that were monomorphic within the target group were discarded, leaving a total of 5,996,827 sites in GSD and 9,405,793 in MON for input into SweepFinder2.

Population structure and phylogenetic tree

To investigate sample relationships, we conducted PCA and phylogenetic analysis using the SNPs from WGS of all dune and non-dune plants. The SNPs were pruned with a LD threshold of 0.2, and PCA then conducted using the R package SNPRelate72. A neighbour-joining tree was constructed using MEGA 10.1.6 (ref. 73) with the maximum composite likelihood method using the same SNP dataset. The neighbour-joining tree was visualized with the R package ggtree74.

Identification of selective sweeps

Selective sweeps in each of the dune ecotypes were identified using three approaches that take advantage of different characteristics of genetic polymorphisms following selective sweeps.

First, we employed the CLR statistic, which contrasts the likelihood of the null hypothesis based on the genome-wide site frequency spectrum with the likelihood of a model where the site frequency has been altered by a recent selective sweep. The CLR statistic was computed using SweepFinder2 (ref. 35), which includes invariant (fixed) sites and accounts for the effects of negative selection on diversity75.

In the analyses of each dune ecotype, we excluded samples from within 100 km of the other dune ecotype to remove their impact on sweep detection. SNPs of the dune samples (target group) were extracted and polarized by comparison with the perennial samples. For each SNP, the outgroup allele was inferred as ancestral state if all four perennials were called and homozygous and the allele matched one of the alleles of the target group; otherwise the site was defined as unpolarized. Unpolarized sites that were monomorphic within the target group were discarded. All unpolarized sites that were polymorphic within the target group were kept and assigned folded = 1 in the input files for SweepFinder2. We generated the allele frequency input files for SweepFinder2 using a custom script, and the recombination rate of each position was calculated based on an integrated genetic map for cultivated sunflower22. Empirically derived allele frequency spectra across all chromosomes were calculated using the -f option of SweepFinder2. CLRs were then calculated using the recombination map and precomputed empirical spectrum with a grid size of 2 kb. No correction for background selection was conducted because genome-wide estimation of background selection has not yet been conducted in sunflowers.

As SweepFinder2 only examines sequence polymorphism within populations, some outlier regions that it identifies might come from broader selective sweeps (that is, regional or species-wide sweeps). Therefore, the other two approaches we employed to detect sweeps permitted comparisons between dune and non-dune populations to filter for those restricted to the dune populations.

A well-known characteristic of hard sweeps is a reduction in sequence diversity in the selected region and at linked neutral sites76,77. Thus, for each site, sequence diversity (π) was calculated for the dune and non-dune groups separately using VCFtools v.0.1.14 (ref. 78) with 20 kb sliding windows and a step size of 2 kb. The ROD79,80 in the dunes was then estimated by summarizing the ratio of π for the dune group and non-dune group (πdune/πnon-dune). Another common feature of selective sweeps is an increase in genetic differentiation in the selected region and closely linked SNPs81,82. Therefore, we used VCFtools to calculate Weir and Cockerham’s FST (ref. 83) using 20 kb sliding windows and a step size of 2 kb to estimate genetic divergence between the dune and non-dune groups.

For each of the three statistics mentioned above, the top 5% of genomic windows were classified as outliers, and groups of adjacent outlier windows were merged into regions. Finally, outlier regions that were identified by SweepFinder2 and overlapped with regions identified by both ROD and FST methods were considered selected regions.

Regions of parallel evolution and enrichment analyses

Genomic regions with outlier values in population genomics calculations may result from the same selective sweep owing to genetic linkage. To account for clustering of regions with signatures of selective sweep, we extracted the SNPs in each selected region and calculated the pairwise LD between each pair of SNPs between windows using PLINK v.1.9 (refs. 84,85). The LD between regions was summarized with the average value of R2 between SNPs from each window. We also generated genome-wide null distributions of LD for the same physical distance. Specifically, we randomly selected 10,000 pairs of 5 kb windows with the same physical distance, and the LD between these windows was summarized using the same approach. Adjacent outlier regions were merged if the LD between windows exceeded 95% of the null distribution (P < 0.05) of the same physical distance.

After merging based on LD, overlaps of the selective sweep regions in the two dune ecotypes were identified using BEDTools86, and these overlapping regions were treated as regions involved in parallel evolution (hereafter ‘parallel regions’). After identification of the parallel regions, we merged the regions again based on the LD between them using the method mentioned above and the genetic map distance between them. Clusters of regions were merged into a single larger region if the LD between closest regions was greater than 95% of the null distribution and the map distance was less than 5 cM. We also pruned for LD based on the same criteria and kept only one parallel region with the greatest size for each LD cluster.

To test whether inversions showed a stronger pattern of parallel evolution than non-inverted regions, we used a randomization approach to find the numbers and sizes of regions overlapping between parallel regions and chromosomal inversions and to estimate the significance of this overlap. Specifically, we held the position of the parallel regions constant and randomized the position of the inversions across the genome 10,000 times. For each permutation, we summarized the numbers and lengths of the regions that overlapped with inversions and built null distributions. Significance of enrichment was estimated using a one-tailed test by comparing the observed values with these null distributions. Using this statistical test, we asked whether the observed overlap between parallel sweeps and inversions was greater than would be expected by chance. The test was conducted on the parallel regions before and after merging based on LD and genetic map distance.

Phylogeny of parallel regions

To estimate the source(s) of genetic variation giving rise to parallel sweeps, we conducted phylogenetic analyses on 12 merged parallel regions containing more than 100 SNPs. For each region, we chose five samples from each dune population, five from non-dune populations in proximity to each dune site, respectively, five samples from H. petiolaris ssp. petiolaris and two perennial samples as the outgroup. We imputed and phased the SNPs using Beagle87 and divided the genome into 20 kb windows. For each genomic window, we randomly chose one haplotype for each individual and calculated a maximum likelihood phylogeny using IQ-TREE88. We then used Twisst89, which takes into account phylogenetic position variation among individuals within a species and calculates topological weighting for each possible gene tree, to quantitatively determine the source of dune-adapted variation. We tested different scenarios in which dune haplotypes are shared or from adjacent non-dune populations (Fig. 5) and also explored where dune haplotypes in GSD and MON came from within-species diversity, H. petiolaris sp. petiolaris or a more distant source (Fig. 5). We extracted all 20 kb windows that had >50% overlap with each parallel region and normalized weights such that each parallel region had a total weight of 1 and counted the total weight for each topology.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.