Introduction

The genus Anastrepha encompasses over 300 identified species, several of which are pests of great economic importance in the Neotropical region1. As with several other Tephritidae, species identification in this genus is hampered by a general reliance on female morphology for identification2. Other strategies have been tried with mixed results, particularly in more closely related species3,4,5 but more emphasis has been given to the use of genetic data to investigate their phylogenetic relationships, using different genetic regions with variable success6,7,8,9,10,11,12. Recent phylogenetic and phylogenomic studies using genomic and transcriptomic data have clarified phylogenetic relationships across different taxonomic levels, such as among species groups, and even among more closely related species in the fraterculus group, which harbors most of the pestiferous species in the genus13,14. These data have been analyzed employing a combination of methods from concatenated supergenes to coalescence and congruence analysis of individual gene trees to explore genetic variability across multiple loci13. The use of multiple methods has revealed that a combination of evolutionary factors limits our ability to infer phylogenetic relationship among Anastrepha, such as introgression, recent and rapid divergence, and high levels of ancestral polymorphism13,15,16.

The differentiation between ancestral polymorphism and introgression is not trivial17 but recent phylogenomic studies have indicated that recent divergence and introgression may contribute to complicate species identification and recognition in Anastrepha13,14. As more genomic data has emerged, it has become clearer that introgression has a much larger role in differentiation across several different taxa18,19,20,21. This may have happened because of historical and/or recurring events, and may even be involved with species adaptation to different selective pressures22a process that can increase pest adaptability to new environments, but influence proper phylogenetic inferences and a better understanding of their diversity23. This may be one of the reasons why different genetic markers have produced discordant phylogenetic inferences in Anastrepha, particularly among more closely related species in the fraterculus group, and has limited our ability to tackle the differentiation of several lineages in the A. fraterculus s.l., which is considered to be a species complex24,25,26,27. The ability to investigate a greater number of specimens across their distribution may be essential to help investigating species boundaries and evolutionary forces involved in their differentiation, but the cost of whole genomic analyses for a geographically comprehensive sampling may still be prohibitive.

The use of a reduced subset of markers for phylogenetic analyses, or phylogenomic subsampling, has emerged as a valuable approach28,29,30,31,32 to offset the costs of producing and analyzing large datasets32 particularly for population studies. The focus on a carefully selected subset of genes may enable targeting phylogenetically informative loci that help resolve contentious or unstable nodes while mitigating confounding factors such as saturation, alignment gaps33,34 or even introgression and ancestral polymorphism35,36. Different criteria have been used in subsampling protocols aimed at identifying reliable loci, including information quantity (e.g., alignment length or data completeness) and quality (e.g., phylogenetic signal or resistance to systematic biases). These approaches vary widely, with some focusing on evolutionary rates, favoring loci with fast, slow, or intermediate rates, or using rate-homogeneous partitions to reduce noise37,38. Others employed the parameter referred to as phylogenetic informativeness (PI) to predict loci likely to resolve relationships accurately39,40 or Internode Certainty (IC), that indicates loci that properly infer correct phylogenetic relationships41. Different strategies have42 used these criteria individually, or combined, to achieve adequate subsampling, but the results have been conflicting32,33.

Finding a reduced set of markers that are capable of discriminating Anastrepha species, particularly among closely related taxa in the fraterculus group may be an invaluable tool for their understanding, monitor and control. The proper identification of these important pests at the species level is crucial for taking appropriate pest control measures. However, for many species, identification based on morphological characteristics requires highly trained taxonomists, and in some cases, the presence of intermediate or conflicting morphological characteristics may complicate this task43. Using a reliable and cost-effective molecular strategy can help taxonomists to make confident decisions for individuals where morphological identification is ambiguous. Additionally, in cases where traditional methods are insufficient, molecular techniques may be the sole means of species discrimination, such as for males or individuals at early developmental stages42.

Previously, we have used over 3,000 loci to investigate phylogenetic relationships among species and lineages of Anastrepha and suggested that subsets of ~ 30 informative loci could be used for species identification in Anastrepha13. Here we test three independent protocols to purposefully subsample genomic/transcriptomic data to identify a potentially minimal set of loci that are both phylogenetically informative and robust against the effects of introgression and ancestral lineage sorting. The identification of a small subset of informative loci is critical for a broader analysis across large populations, but also to help design cost-effective strategies for species identification in Anastrepha, paramount for the development of tools to monitor and control several important pest species that belong to this genus.

Results

The Anastrepha dataset investigated here is derived from the reanalysis of whole genome sequencing, complete genome assemblies, or transcriptomes from 36 specimens of 15 Anastrepha species which included seven species groups and 17 samples from the A. fraterculus complex across South America and Mexico as well as five species of other tephritid genera, used as outgroups (Supplementary Table S1). This dataset consisted of 3,170 clusters of orthologs that had an average of 1,432 bases per cluster, providing a total of 4,551,036 bases and about 20.5% missing data for the ingroup. The fraterculus dataset was composed of a subset of the Anastrepha dataset, including 13 species of the fraterculus group, A. psidivora, as well as A. striata and A. bistrigata as outgroups. This set was composed of 3,168 clusters of orthologous with an average of 1,545 bases per cluster, totaling 4895,310 bases and about 21% of missing data. We explored the lack of phylogenetic signal at different parts of the topology to investigate which evolutionary forces might be influencing this pattern following the flowchart delineated on Fig. 1.

Fig. 1
figure 1

Workflow of procedures used to evaluate three reduced sets of loci for their phylogenetic informativeness in studying Anastrepha fruit flies. In this framework, we reanalyzed a dataset previously produced on Anastrepha13 to produce a set of orthologs and three subsets of 20/37 loci using different methods. The Info subset chose the 20/37 genes with largest number of phylogenetically informative sites. The sgC60 subset picked the 20/37 genes with highest values that had at least sCF (site concordance factor) values above 60% as well as average gCF (gene concordance factor) values above 50%. The third subset was produced from Sortadate44 choosing the 20/37 genes with highest values of a combination of the largest bipartition support and minimization of tip-to-root variation. The full dataset was analyzed using ASTRAL from individual ML gene trees and ML analysis from a concatenated dataset to produce a reference species tree that was used as a yardstick to contrast to species trees produced from these different subsets using several different parameters. The investigation of phylogenetic signals, concordance levels, selection patterns, saturation, evolutionary rates and levels of introgression is used to evaluate how effective each subset is at producing phylogenetically informative species trees for the study of Anastrepha at different evolutionary levels.

Species tree inference and concordance analyses

The topologies of the species trees produced using different methodological approaches (such as super-matrix via concatenation and multispecies coalescence) derived from the Anastrepha dataset (encompassing all Anastrepha samples and outgroups) as well as the fraterculus dataset (restricted to samples from the fraterculus species group, A. psidivora, and two outgroups) were highly congruent with each other and with previous analyses13. Figure 2 shows the ASTRAL inference derived from individual gene topologies and the maximum likelihood of the full concatenated dataset is on Supplementary Fig. S1. The respective results for the more reduced fraterculus set is shown as Supplementary Figs. S2 and S3. Most clades associated with potentially valid taxa had high support (100% bootstrap for the concatenated approach and 1.0 PP) for the multispecies coalescent approach, despite also showing high levels of phylogenetic conflict across different genes. This pattern is reflected on congruence values (measured by gCF, sCF, or quartet support) smaller than 50%, which tends to decrease when contrasting more closely related taxa, therefore, deeper nodes in general had greater support and more concordance across different markers. This pattern is found on both datasets, but since their results are very similar, we show here the PhyParts tree based on the whole Anastrepha dataset (Fig. 3), and the fraterculus PhyParts tree as Supplementary Fig. S4.

Fig. 2
figure 2

Multispecies coalescent species tree of Anastrepha species and five other Tephritidae species as outgroups based on 3,170 genes inferred in ASTRAL. Phylogenetic supports (bootstrap / gene concordance factor / quartet support) are shown on the nodes.

Fig. 3
figure 3

Summary of gene concordance and conflict based on PhyParts analysis against the species tree inferred on ASTRAL based on the Anastrepha dataset. Pie charts show gene concordance (blue), conflict (green = one dominant alternative; red = other conflicting trees), and non-informative (gray) at each node.

Though in general there is an inverse relationship between congruence and evolutionary distance, there were some shallower nodes with high support as well, even among some intraspecific nodes. Because the topologies produced from both datasets do not vary from what previous analyses have shown, where a more detailed analysis of these relationships was presented13 we will only emphasize here aspects that are enriched by the congruence analyses. The main point of discordance between different datasets and methodologies entails the separation of clades III, IV, and V of A. fraterculus in the west South America and A. turpiniae. Both datasets produced basically the same topology among these taxa when comparing concatenated ML and ASTRAL inferences, although ASTRAL separated A. fraterculus Clades III and IV. Furthermore, the fraterculus dataset ML tree separates these different west South American A. fraterculus clades, but places A. turpiniae as a sister taxon.

The lack of resolution among lineages in the western South America is reflected in the congruence analyses, which shows 60–70% discordance across different genes for most of these relationships and quartet supports between 2% and 20% (Fig. 2). That is similar to what is observed across the fraterculus species group, with gCF support values for several different species failing to exceed 35%. In fact, the A. fraterculus complex exhibited considerable variation in gCF support, ranging from 11 to 70% for different lineages, some of which may represent different species6,13. The support for the fraterculus species group, on the other hand, is very robust (75% of the genes supporting this lineage) and showing A. psidivora as sister taxon to the fraterculus group. As we compare more distantly related taxa, the level of congruence across different genes tend to increase, being around 48% for the comparisons across species groups and reaching over 90% for the separation of genera, placing Anastrepha more related to Rhagoletis than to Ceratitis, Bactrocera, and Zeugodacus; this is a result corroborated by other phylogenetic analyses45 as well as by the current taxonomy, which places the former two genera in Trypetinae, whereas the latter are in Dacinae.

Despite high levels of incongruence across different loci on the genome, indicated by sCF and gCF values below 50% in several nodes, the consistency across datasets and methodologies provides support for the lineages we observed, which also correlates well with genera, species groups and species (even in the fraterculus complex). We explored these results by investigating whether we could find a smaller assortment of genes that might be more phylogenetically informative and congruent than the average loci as measured by multiple parameters.

Phylogenetic inferences and the search for informative genes

We employed three different methods to create subsets of informative markers: highest number of parsimony-informative sites, over 60% of sites with high sCF and gCF values, and SortaDate, which considers bipartition support and minimization of tip-to-root variation; we refer to these as “Info”, “sgC60”, and “Sortadate”, respectively (see methods for details). For each method, we also compared two different numbers of markers, 37 and 20 (see methods for rationale), which is reflected in dataset names (e.g., “Info_37”). The three strategies chosen to find a reduced set of informative genes (See list of genes and some parameters in Supplementary Table S2) produced similar, but not fully congruent topologies in general. Observed discrepancies involved closely related samples from the same species, especially with sets of 20 genes rather than 37, which can be visualized in the cophylo analyses (Fig. 4). Even though all major phylogenetic relationships were recovered in each of the subsets when compared to the full dataset (as can be seen on Fig. 4), tests of tree compatibility show that there is significant difference in their likelihood at explaining the full dataset using Shimodaira Hasegawa (SH), weighted SH, Expected Likelihood weight or unbiased tests (Supplementary Table S3), with the exception of sgC60_37. Like before, these results were very similar when comparing the two different main datasets, so we chose to present information only for the Anastrepha dataset, considering that the markers chosen should provide information across different divergence spectra in the genus, not only among fraterculus species group.

The “info” subsets, which chose the genes with highest number of informative characters, selected loci that were missing in some specimens, leading Info_20 and Info_37 datasets to have a smaller number of taxa when compared to the full dataset (Figs. 4C and F, 5D and 6D, contrasted against 5A and 6A). Overall, however, the topologies of Info_20 and Info_37 were compatible with that of the full dataset, and the missing taxa (A. hadracantha, A. leptozona, A. curitis, A. striata, A. psidivora, and a few A. fraterculus) were associated with lower congruence values in the full dataset results. The other two strategies, Sortadate and sgC60, produced topologies that represented all samples and with less phylogenetic discordance across loci (the remaining contrasts on Figs. 4 and 5, and 6, namely Figs. 4A and D, 5B and 6B for sgC60 and 4B, 4E, 5C, 6C for Sortadate).

Fig. 4
figure 4

Visualization of phylogenetic congruence between the topology produced by ASTRAL inferences from the full dataset (on the left of each cophylo tree) and each of the three gene subsets of 20 and 37 loci. (AC) show the cophylo trees using the sets of 20 loci with the highest sC > 60%. inferred from Sortadate, and number of informative characters, respectively. (DF) show the cophylo trees using the sets of 37 loci with the highest sC > 60%, inferred from Sortadate, and number of informative characters, respectively. Blue lines indicate matching taxa on contrasting trees.

The percentage of genes supporting a specific topology or specific nodes varied, but were generally higher in the subsets of informative genes compared to the whole datasets (Table 1; Figs. 5 and 6). Around 33% of the loci were informative and compatible with the topology produced by the full dataset, which had 38 total nodes. The subsets are represented by different sets of loci, with only three loci in common between sgC60 and Sortadate. Loci were also distributed across all chromosomes (Supplementary Fig. S5), although six of the seven markers on the X chromosome (scaffold NC_071503.1) came from the Info subsets. All subsets had a higher number of genes supporting each node, at the expense of the number of nodes resolved, and this was more pronounced in the smaller subsets of 20 genes (Table 1). The Sortadate strategy produced, on average, the best results, with the highest number of nodes recovered (16 of the 17 nodes that separate species and 36 of 37 nodes in general) by the largest number of genes (over 52%, compared to 32% for the full dataset), though sgC60 also recovered a high number of nodes (15 of 17 species nodes and 33 overall). Both strategies only failed to resolve the phylogenetics relationships among the west South American A. fraterculus clades and A. turpiniae. While the info subsets had the highest number of loci supporting their inference (> 50%), they missed several nodes and recovered only nine of the 17 node species. As support levels varied across different lineages and different subsets, we explored other parameters to understand specific patterns and processes associated with the sets of selected loci at different levels in the hierarchy (Supplementary Fig. S6 shows nodes that were used to represent separation of species, species groups, and genera).

Fig. 5
figure 5

Summary of concordance and conflict inferred on Astral and plotted with PhyParts and PhyPartsPieCharts. (A) all 3,170 genes. (B) 20 highest sgC60, (C) 20 highest genes inferred from Sortadate; (D) 20 genes with highest number of informative characters (ASTRAL). Pie charts indicate number of genes which support the node (blue), most frequent alternative (green), support any of the other alternatives (red), and failed to be informative for that node (gray).

Fig. 6
figure 6

Summary of concordance and conflict inferred on Astral and plotted with PhyParts and PhyPartsPieCharts. (A) all 3170 genes. (B) 37 highest sgC60, (C) 37 highest genes inferred from Sortadate; (D) 37 genes with highest number of informative characters (ASTRAL). Piecharts indicate number of genes which support the node (blue), support the most frequent alternative (green), support any of the other alternatives (red), and failed to be informative for that node (gray).

Association of phylogenetic parameters to reduced subsets

We investigated several phylogenetic and evolutionary parameters across genes included in our subsets, summarized in Table 1 and visually represented in Fig. 7. In general, these parameters indicate that the genomic regions surrounding informative loci are evolving under purifying selection, indicated by average low values of omega (ω) for both the Anastrepha and fraterculus datasets (Table 1; Fig. 7A and E, and Supplementary Fig. S7). The fraterculus dataset, which compared more closely related taxa, had slightly higher average ω than the Anastrepha dataset (0.10 and 0.06, respectively), which could indicate saturation. However, a separate estimation of saturation between the datasets suggests that saturation more strongly affected the more divergent taxa and did not impact the phylogenetic signal (Table 1; Fig. 7C and G, and Supplementary Fig. S8). Additionally, average evolutionary rates were higher in Tephritidae than when comparing more closely related taxa in Anastrepha (Table 1; Fig. 7B and F, and Supplementary Fig. S9), indicating that the higher saturation values are not enough to quash the association between evolutionary distance and change. It is relevant, though, that all subsets had higher average coalescent time values than the Anastrepha dataset, indicating that these subsets are more effective at separating more closely related lineages. This is also reflected on average values of treeness (Table 1; Fig. 7D and H), which measures the degree to which the total length of the phylogenetic tree is retained in internal branches. Despite the presence of evolutionary signal, there is a great deal of incongruence reflected on the average values of Internode Certainty (Table 1) and their distribution across the genome (Supplementary Fig. S10). IC values were measured by contrasting the support of the main topology (sC) against the alternatives (sD1 and sD2) for each node, averaged over different taxonomic distances, from more closely related species in the fraterculus group to species groups. This analysis showed that in general there are no major discrepancies across their genomes, though there is more heterogeneity at lower taxonomic contrasts (Supplementary Fig. S10). A comparison of these estimates, with average f4-ratio per gene as a conservative estimate of introgression, suggested that introgression decreases as you investigate more divergent taxa (Table 1). Like IC estimates, f4-ratio values vary along the genome over different taxonomic distances, showing higher heterogeneity among more closely related taxa (Supplementary Fig. S11).

Table 1 Average values for different evolutionary and population parameters across the whole dataset and in each selected subset.

Treeness and average IC were significantly higher in the sgC60 and Sortadate subsets, when compared to random gene subsets, and lower in the info subsets (Table 1; Fig. 7 and Supplementary Fig. S12). Interestingly, both info subsets had lower average IC values than random genes on the genome (Table 1, which also shows average values of random subsets of 20 and 37 loci for all attributes here investigated). The introgression estimate f4-ratio was significantly lower in info subsets than on the complete dataset, or in other subsets as well (Table 1 and Supplementary Fig. S13), but this could be a consequence of the absence of several closely related samples, so much so that there is no intrapopulational f4-ratio estimate from info subsets.

Fig. 7
figure 7

Violin plots of values of (A) Log(dN/dS), (B) Evolutionary rate, (C) Saturation and (D) Treeness for the frateculus dataset, and (E) Log(dN/dS), (F) Evolutionary rate, (G) Saturation and (H) Treeness for the Anastrepha dataset. Sets are as follow: all (corresponding Anastrepha or fraterculus dataset), subsets sgC60_20 and sgC60_37, Sortadate_20, Sortadate_37, Info_20 and Info_37, as well as random sets of 20 and 37 loci from the whole dataset (all_20 and all_37 are resampling of the current datasets, whereas set2023_20 and set2023_37 are resampling from the informative loci from Congrains et al., 2023).

Discussion

Previously, we created a workflow to identify and align orthologs from different taxa that was very useful to infer phylogenetic relationships among different species of Anastrepha13. This is relevant when you consider that this genus is composed of numerous closely related taxa, especially in the fraterculus group where divergence may have been affected by gene flow6,11,14,24 which complicates inference of their phylogenetic relationships. This complexity limits the ability of a reduced set of markers to effectively identify different taxa and affects practical and streamlined applications that would profit from the use of a smaller, more cost- and time-effective, subset. In that paper, we used a large dataset of 3,170 genes, as well as a more reduced set of over 100 phylogenetically informative loci to identify lineages in the fraterculus group which in general agrees with the current taxonomy13. Here we explore three strategies to identify smaller subsets of markers that would remain useful not only for species identification, but also for phylogenetic inference among more closely related taxa and even lineages in the A. fraterculus complex. The search for minimal gene sets is central to comparative genomics46 although due to current database sizes, it is dependent on phylogenetic scoring across extensive gene sets which may introduce biases, as genes vary across time and space in evolutionary and populational parameters47. We tested three approaches to investigate phylogenetic relationships by leveraging different phylogenetic and population parameters that allowed us to identify subsets that were informative for the purpose of phylogenetic inference and lineage identification using groups of either 20 or 37 loci.

A central question of this study was to determine the effectiveness of smaller subsets in inferring accurate phylogenetic relationships across different levels of the phylogenetic hierarchy, in particular between species and species groups. In general, our results show the effectiveness of these reduced datasets to infer relationships among more distantly related, and even among more closely related species, based on the comparison to a genome-scale dataset. It is noteworthy that all three chosen strategies to find phylogenetically informative genomic regions for the Anastrepha phylogeny, be it considering 20 or 37 genomic regions, produced topologies that were very similar to the one produced with the full dataset of 3,170 genomic regions. Not only did these estimates produce high support values, but the subsets had higher congruence among genes than the full datasets. Furthermore, average gene support and Internode Certainty increased in our subsets compared either to the whole dataset, or with more encompassing groups, particularly at the species level, which indicates that discordance is being mitigated at the intended level.

All gene subsets show overall congruence with regards to the position of Anastrepha relative to the outgroups in any of the analyses we performed, confirming its purported monophyly12. Furthermore, our results corroborate the use of the other Tephritidae genera here as adequate outgroups for this analysis, since it has been suggested that concordance, rather than bootstrap support, should derive outgroup delimitation16. Even with the commonalities, there are a few differences among inferences from the different subsets that are mostly related to shallow nodes and closely related taxa. One caveat is that subsets Info_20 and Info_37 failed to include all samples (missing the species A. hadracantha, A. leptozona, A. curitis, A. striata, A. psidivora, and some specimens in the fraterculus species group) and thus resolved only about half of the species nodes. This is most likely due to inclusion of genomic regions that were highly variable, large, and had only portions of the gene recaptured for different samples. Despite this sampling limitation for Info_20 and Info_37, these subsets produced the same topology as the full dataset for samples that were included. However, the more limited sampling affected introgression estimates, since there were several samples missing in the Info datasets that limited our ability to estimate introgression among closely related lineages of A. fraterculus. Additionally, given that the hierarchical levels analyzed here (genera, species groups, species) differ greatly in terms of genetic divergence, it is expected that the number of variable sites per category will also vary. However, despite this inherent bias, the gene sets remain reasonably consistent when considering the variability of categories including closer related taxa. For example, the majority of genes on Info_20 and Info_37 datasets remain as the most informative loci when comparing subsets that exclude outgroups, or even only among species in the fraterculus species group.

High mean gene congruence was observed across all subsets, indicating that the sampling was effective at procuring subsets of genes that better reflect phylogenetic relationships. Despite this congruence, there was a significant difference in likelihood support when comparing subset trees to trees from the full Anastrepha dataset using Shimodaira Hasegawa (SH), weighted SH, Expected Likelihood weight or approximately unbiased test, with the exception of sgC60_37 (Supplementary Table S3). Most of the few differences between the topologies of subsets and the full dataset affected only a few closely related taxa or taxa with established uncertainty. For example, the position of A. turpiniae along A. fraterculus Clade III, Clade IV, and Clade V, as well as the position of A. obliqua from Colombia, which is basal in the full Anastrepha dataset but in the middle of other Brazilian populations in the fraterculus group dataset. These incongruences can be clearly observed on the PhyParts analyses (Fig. 4), since the placing of A. fraterculus Clade V along A. turpiniae is supported only by about 5% of the genomic regions, whereas about 50% support any alternative relationships.

The different subsets effectively recovered phylogenetic relationships among taxa inferred from the full dataset, as previously obtained13 despite the fact that there were different parameters chosen to create each subset. All subsets had higher average coalescent estimate than the full dataset and from random sets of genes. Overall, Sortadate and sgC60 shared more attributes to each other, such as higher average IC and treeness, whereas Info subsets had lower average IC across several parts of the hierarchy, lower introgression estimates, and higher levels of saturation. The Sortadate strategy chooses regions that show the lowest evolutionary rate variation across different branches combined with the highest bipartition support. The latter was estimated considering the topology inferred by the whole Anastrepha dataset from ASTRAL or ML. Sortadate subsets, regardless of using 20 or 37 loci, showed average values for most evaluated parameters. However, they exhibited the highest average treeness, the longest average coalescent times per node overall and per species node, suggesting that these Sortadate sets produce longer branches descending from internal nodes. The sgC60 subsets had on average the highest average IC values when compared to other regions on the genome, for several levels of the hierarchy. This result is not surprising, considering this parameter is based on sC values across different nodes. sgC60 subsets also showed higher dN/dS rates estimated from more distantly related species (though not significant) and among more closely related taxa in the fraterculus dataset. Despite higher values of omega, in general the sgC60 subset had significantly lower evolutionary rates than other random genes across the entire genome. Finally, neither sgC60 nor Sortadate showed significantly different f4-ratio estimates from random genes, or the full dataset (though Sortadate_37 was marginally higher), indicating that in general these genes did not show different rates of introgression compared to other regions of the genome. Based on these comparisons, Info subsets exhibited lower phylogenetic consistency than the subsets produced using the other two strategies. In particular, the Sortadate subsets demonstrated a good balance of conservativeness, characterized by the lowest missing data, highest coalescent averages, high Treeness, and average Internode Certainty (IC) values that effectively reproduced the original topology from the full dataset. This finding positions the Sortadate set as an excellent candidate for further investigation, including samples across the genus and particularly within the intricate fraterculus group.

The general pattern of phylogenetic incongruence among different regions across the genome may be a consequence of differential responses to adaptation (and different evolutionary rates might be an indication of that), but most often is a consequence of other processes as well, such as ancestral lineage sorting and introgression35,48,49. It is the combination of these different processes which complicates phylogenetic inferences in Anastrepha in general, and species in the fraterculus group in particular. A few studies have suggested that individual markers, such as morphological, behavioral, or molecular markers using a few loci, in some cases even a single locus, such as COI or ITS, might be informative to study variation across Anastrepha50,51,52,53. Even though these markers in general may be effective at investigating aspects of variation across the genus, and can differentiate some species, our studies have indicated that they alone fail to provide resolution more broadly due to the complex evolution that has shaped this genus13.

Considering that several of these taxa diverged recently in the presence of gene flow11,13,51 and several samples studied here belong to the same species, there should be no expectation that consistent topological congruence would be found across the genome. The identification of genes that account for evolutionary processes such as introgression, gene flow, and hybridization is critical for clarifying phylogenetic relationships among closely related and cryptic species. Genome-wide scans for “barcode” loci are increasingly used to find informative genomic regions. For Anastrepha, this task is further complicated by their adaptation to diverse sets of hosts and environments1,54,55 which might produce differential selective pressures across the genome (and in other tephritid species have proved complex to study56,57), as well as by other complex populational processes, such as asymmetric gene flow observed in some species13,15. Understanding how introgression influences speciation aids in elucidating adaptive processes, and given the complex spatial and biological boundaries of species58,59future studies must demand broader sampling not only of specimens from different localities and species, but also across the genome.

The analysis of sets of orthologs, followed by their reduction to smaller subsets of genes with phylogenetic potential, goes beyond an academic exercise. The genus Anastrepha includes many pest species that cause significant damage to fruit crops of great commercial value43,60 several of which are only properly differentiated with the help of qualified taxonomists. Developing genetic markers for identifying these species offers a promising tool that should complement traditional morphometric methods. This is even more important considering the preeminent use of adult female morphometric data for proper identification of different taxa in the genus, even among less closely related taxa1,2. Many Anastrepha male adults, larvae, and pupae cannot currently be identified to species level based on morphology, due to a gap in knowledge or absence of informative characters, despite recent advances3,4,42. Having a subset of informative genes could speed up and improve identification, allowing for more efficient and targeted control (e.g., species-specific management strategies which can reduce the costs of insecticides and other expensive, environmentally harmful methods). Additionally, diverse genomic markers, such as the ones identified here, provide flexibility for implementation with various technologies (e.g., amplicon sequencing61 probe-based capture approaches) that might require specific genomic features (lack of introns, specific length, etc.). As sequencing data continues to become more accessible, genomic sequencing in deeper parts of the Anastrepha phylogeny, and particularly across the distribution of their several taxa should be paramount to establishing the most effective ways to identify informative loci.

Materials and methods

Data source and processing

We explored data previously produced from a pipeline that relied on the selection and alignment of orthologous genes through the integration of transcriptomic and genomic data in a phylogenetic framework. The pipeline we used is a slight modification of what has been described elsewhere13,14. Different than in our previous analysis13 we did not implement a filter based on taxon occupancy to prevent discarding potentially informative genes whose orthology inference has failed due to high evolutionary rates. The data we used was derived from genomic and transcriptomic sequences from Anastrepha specimens and five specimens from other genera of Tephritidae as outgroups (Supplementary Table S1). These data were used to produce two different datasets with aligned sequences from different specimens to perform the analyses hierarchically, with one focusing on species of the fraterculus group (fraterculus dataset) and a more encompassing dataset that investigated patterns of evolution across the Anastrepha genus (Anastrepha dataset). The Anastrepha dataset used 3,170 orthologs from 36 specimens of 15 species of Anastrepha (21 when considering lineages of fraterculus complex as separate taxa), as well as the outgroups, whereas the fraterculus dataset was comprised of 3,168 orthologs from 27 specimens of 8 species, that encompassed all samples from the fraterculus group, A. psidivora, which is considered incertae sedis, and two outgroups, A. striata and A. bistrigata; the latter set of samples has a greater emphasis on species of the fraterculus group13. It is important to mention that since different taxa in the A. fraterculus complex have not been formally separated into different species with proper names, we are using the term A. fraterculus s.l. to refer to an assemblage of lineages that forms a polyphyletic taxon, though we recognize several different lineages that have the support of different attributes as potentially valid separate taxa in the complex. When we consider all these taxa separately, we have a total of 13 taxa in the fraterculus dataset. The Anastrepha dataset expands the analyses also to other species groups and includes Zeugodacus (Bactrocera) cucurbitae, Bactrocera dorsalis, Bactrocera oleae, Ceratitis capitata, and Rhagoletis zephyria as outgroups (Supplementary Table S1).

Species trees inference and concordance analysis

We used DNA-based alignments for the subsequent phylogenetic analysis. Model-Test-NG62 to infer the best-fit nucleotide substitution model was performed based on the Bayesian information criterion (BIC) for each alignment of orthologous genes, independently for each set (fraterculus and Anastrepha), which was used in RAxML-NG63 to infer Maximum likelihood (ML) gene trees with 200 rapid bootstrap replicates. Species trees were inferred by concatenating gene alignments into a super-matrix using a concatenation method and also by combining gene trees through multispecies coalescent approaches for both datasets. Concatenation analysis involved incorporating the best-fit model for each gene and employing 200 bootstraps using IQ-TREE v. 2.1.264. Multi-species coalescent trees were estimated based on ML gene trees using default parameters in ASTRAL-III v. 5.7.765. Phylogenetic support was evaluated through the gene concordance factor, local posterior probabilities (PP), and quartet support. The gene concordance factor represents the proportion of gene trees with compatible topologies with a branch in the species tree66 calculated using IQ-TREE v. 2.1.264. Local PP and quartet support were computed based on the quartet frequency in the gene trees67 estimated using ASTRAL v. 5.7.765.

We investigated the proportion of genes within our comprehensive dataset that exhibit higher concordance with the species tree generated by ASTRAL using PhyParts68 a bipartition-based method, for a concordance and conflict analysis with a bootstrap support threshold < 70%. Subsequently, we utilized the python script PhyPartsPieCharts.py (https://github.com/mossmatters/phyloscripts) to visualize the results with pie charts that show the number of gene trees that were concordant, conflicting, or uninformative with respect to the species tree in each node. We used Gotree69 throughout the workflow to remove zero length branches and to root trees.

Search for phylogenetically informative genes

Previous studies have been successful at using molecular data to bring some phylogenetic resolution among different species of Anastrepha, especially those belonging to the fraterculus group, by using a large number of loci. Here we investigated whether we could find similar resolution using a more reduced set of phylogenetically informative genes, and to that effect we tried three alternative methods. We chose strategies that would define the smallest number of loci that were still effective at identifying a set of taxonomically and topologically well-supported nodes. The procedures here considered two sets of nodes that are informative at different hierarchical levels, one that separates more distantly related taxa, such as genera, species groups, and species in the genus Anastrepha, and another that investigates lineages of the A. fraterculus complex. We should mention that whenever a species group is represented by a single species, we allocated this node to the species group. On the other hand, lineages in the fraterculus species were considered as “species” when there was association to other traits that would support a stronger claim of its independence in the A. fraterculus species complex. This is akin to what has been used elsewhere13 though we expanded the nodes previously considered to account for more divergent taxa.

The initial strategies used to select phylogenetically informative sets, which is exploratory in nature, relied on data derived from the IQ-TREE v. 2.1.2 software64,70. From these analyses, we tested two approaches to identify sets of informative genes for elucidating relationships among Anastrepha species, and particularly species in the fraterculus group. For the first approach, we used a strategy that simply considered the number of parsimoniously informative sites per gene, as defined by71 and retained all genes with over 1,000 variable sites, derived from IQ-TREE v. 2.1.2. This cutoff value was established because a cursory investigation of data generated on Congrains, et al.13 suggested that there might be topology resolution when investigating between 20 and 40 genes. In this case, there was a small window separating the class over 1000 sites from the others (data not shown). We selected all genes with over 1,000 variable sites (37 genes) that was referred to as Info_37. To make the other approaches/subsets comparable, we also selected 37 genes in the following strategies. The second approach retained genes that had sCF values above 60% for phylogenetically informative nodes (nodes defining the hierarchical levels discussed above), based on the ML phylogenetic tree inferred by that gene, as well as an average gCF values above 50%. This set is referred to as sgC60_37. For the third approach, we used the SortaDate software (https://github.com/FePhyFoFum/sortadate) employing a combination of two criteria to choose a subset of genes, largest bipartition support and minimization of tip-to-root variation. SortaDate assesses similarities between gene trees and the species tree, focusing on tree length variation and retains only genes that meet certain optimality criteria for phylogenetic information44. This set is referred to as Sortadate_37. We also explored subsets for each of these strategies with a more reduced number of loci, and we present here results for 20 loci per subset (referred to as Info_20, sgC60_20, and Sortadate_20, respectively), which could make these analyses more amenable for further practical applications.

We evaluated the ability of the three strategies to produce a reduced number of loci that were still informative to identify lineages at different phylogenetic levels in the Anastrepha phylogeny by contrasting the inferred species trees produced on ASTRAL using only the selected genes with those derived from the full datasets. This was investigated by looking at the Gene Concordance Factor (gCF) using the IQ-TREE v. 2.1.2 to test the node support across the phylogeny, especially for well-defined groups, i.e., clades that are associated with taxonomically or morphologically recognized taxa (following Congrains, et al.13 and represented on results). This was further analyzed by using PhyParts68.

Incomplete lineage sorting (ILS) and introgression

Given the history of recent divergence, incomplete lineage sorting (ILS) and introgression in Tephritidae72,73 particularly in Anastrepha13,14,15,51 we conducted specific analyses to investigate the relationship between concordance factors and the bootstrap method generated by IQ-TREE v. 2.1.2. We used these concordance factors to test the assumptions of an Incomplete Lineage Sorting (ILS) model.

We investigated for introgression using the Dsuite computational tool (available at https://github.com/millanek/Dsuite) to calculate Patterson’s D statistics, known as ABBA-BABA metrics along the genome74. These parameters were adopted to infer signs of introgression between lineages. Attention in this study focused on exploring alleles of common ancestry among combinations of three species arrangements that are phylogenetically tenable. Due to the heterogeneous distribution of the locations of the genes across the genome, we also performed the f4-ratio analysis75 which examines introgression signs delimited to specific genomic intervals. We used several sets of species (see Supplementary Table S4) to estimate a compounded general estimate of introgression per gene that would not depend on an individual species set, but rather, would indicate the general propensity of each marker to move across lineage (or “species”) boundaries. We used Benjamini-Hochberg (BH) corrections to adjust p-values to account for multiple tests and graphically represented figures using Microsoft Excel and R76 .

For the purpose of investigating introgression, we treated different lineages of A. fraterculus as separate species when there were other traits to support that separation, following previous phylogenetic analyses that indicated several clades13,14,25 referred here as A. fraterculus clades I through VII. Furthermore, some of these analyses were only performed for taxa for which we had more than one specimen sampled.

Phylogenetic and evolutionary parameters

Several parameters were estimated from aligned orthologs to indicate levels of polymorphism, phylogenetic signals, phylogenetic congruence, and levels of introgression using different procedures, which we used to contrast each reduced subset to the full dataset. This analysis aimed to investigate whether these parameters were associated with different gene subsets. We estimated transition and transversion evolutionary rates calculated as the total length of the tree divided by the number of terminals77 and their levels of saturation, using the Saturation test78 in Phykit79. We also used Phykit to estimate evolutionary rates, defined as one minus the sum of squared frequency of different bases at a given site, and treeness80 which investigates how well evolutionary distances among specimens are better explained by trees, rather than networks, with higher values denoting a more significant signal-to-noise relationship.

We also used IQ-TREE v. 2.1.264,70 to produce a consensus phylogenetic tree based on gene ML trees. The analysis employed 1,000 bootstrap replicates to assess statistical robustness of the consensus tree. We employed a heuristic search process with 100 clusters for sequence grouping (--scf 100), with Coalescence Hidden Markov Model (CHMM), and the double-forest approach (--df-tree), with the verbose option (--cf-verbose). These analyses were used to estimate gCF (gene concordance factor = gCF_N/gN%), sCF (site concordance factor with an average of 100 quartets = sCF_N/sN%), and sC (number of concordant sites with an average of more than 100 quartets), as well as average coalescent times per node. We estimated Internode Certainty per node per gene according to Kobert, et al.41 and also the average coalescence time in coalescent units per node using IQ-TREE16. Individual values were averaged over branches depending on the comparisons performed.

We investigated patterns of selection on individual gene regions by estimating the ratio of nonsynonymous to synonymous changes (ω) on clusters of orthologs across branches of phylogenetic trees. The number of synonymous (dS) and nonsynonymous (dN) changes and its ratio (dN/dS or ω) reflect patterns of selection on the region, whereby a ω > 1 is considered strong evidence of positive selection for amino acid substitutions, while ω ≈ 0 indicates purifying selection81. This analysis was performed using the branch-site model Busted82 implemented in Datamonkey83. Additionally, we generated 200 groups of 20 and 37 loci randomly pulled from the whole dataset and the set of genes selected in a previous study13 using a custom python script (https://github.com/popphylotools/sampling_random_trees). This analysis allowed us to compare the distribution of each parameter (ω, evolutionary rates, saturation, and treeness) across the full dataset, the selected subsets, and the randomly subsampled group of genes.

Tests of tree compatibility

The software IQ-TREE v. 2.1.264 was used to compare whether the topology produced by each subset is significantly different than the topology inferred by the Anastrepha dataset. We tested whether maximum likelihood scores of each subset was significantly different than the score of the full dataset using the Shimodaira-Hasegawa test84 expected likelihood weights85 and the approximately unbiased test86 performed in IQ-TREE v. 2.1.264. A visual comparison of the topology produced by the full Anastrepha dataset and each subset topology was performed on R76 using the function cophylo in Phytools87 to create co-phylogenetic plots and identify incongruences between the topology produced by the full dataset and that of each subset.

Gene mapping

We used Exonerate88 to align one sequence for each ortholog cluster against the genome of A. ludens89 (GenBank: GCA_028408465.1) using default parameters and the est2genome model to generate a physical map indicating the location of genes on chromosomes. To do so, we chose sequences from A. ludens, whenever available, otherwise we used sequences of A. distincta, which is closely related to A. ludens. The interactive mapping visualization across chromosomes was performed using ChromoMap v0.4.190. These analyses were performed on the full set of markers, as well as on the phylogenetically reduced sets of 20 and 37 markers.