Identifying sets of phylogenetically informative markers for Anastrepha (Diptera: Tephritidae)

de Brito, Reinaldo A.; dos Santos, Edyane Moraes; de Freitas, Patricia Domingues; Dupuis, Julian R.; Congrains, Carlos

doi:10.1038/s41598-025-16399-2

Download PDF

Article
Open access
Published: 01 October 2025

Identifying sets of phylogenetically informative markers for Anastrepha (Diptera: Tephritidae)

Scientific Reports volume 15, Article number: 34258 (2025) Cite this article

500 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Phylogenomic analyses have revolutionized our understanding of evolutionary relationships, yet it is complicated by incongruence across the genome. Here, we reanalyzed a genomic dataset comprising 3170 orthologs, and evaluated three methods to identify reduced sets of loci that can accurately resolve evolutionary relationships among Anastrepha fruit flies. Previous phylogenetic analyses consistently revealed well-supported topologies for deeper evolutionary relationships, while more recent divergences, particularly within the A. fraterculus complex, exhibited high levels of phylogenetic incongruence due to gene flow, incomplete lineage sorting, or other evolutionary forces. Here, we explored strategies for selecting reduced subsets: number of informative sites per gene, site concordance factor above 60% for clades consistent to current taxonomy, and tip-to-root variation/bipartition support. Among the strategies tested, subsets based on concordance and evolutionary rate metrics produced topologies consistent with full dataset analyses, with reduced levels of discordance. These subsets maintained robust support for deeper relationships while increasing congruence at shallower nodes. Although genes in reduced subsets exhibited lower evolutionary rates, they had higher internode certainty, treeness, and coalescent times. These findings highlight the potential for carefully selected loci to improve phylogenetic resolution and mitigate conflicting signals. Our study offers practical approaches for refining phylogenomic analyses in systems with complex evolutionary histories.

Incongruence in the phylogenomics era

Article 27 June 2023

Plastid phylogenomics resolves ambiguous relationships within the orchid family and provides a solid timeframe for biogeography and macroevolution

Article Open access 25 March 2021

Characterization, comparison, and phylogenetic analyses of chloroplast genomes of Euphorbia species

Article Open access 04 July 2024

Introduction

The genus Anastrepha encompasses over 300 identified species, several of which are pests of great economic importance in the Neotropical region¹. As with several other Tephritidae, species identification in this genus is hampered by a general reliance on female morphology for identification². Other strategies have been tried with mixed results, particularly in more closely related species^3,4,5 but more emphasis has been given to the use of genetic data to investigate their phylogenetic relationships, using different genetic regions with variable success^{6,7,8,9,10,11,12}. Recent phylogenetic and phylogenomic studies using genomic and transcriptomic data have clarified phylogenetic relationships across different taxonomic levels, such as among species groups, and even among more closely related species in the fraterculus group, which harbors most of the pestiferous species in the genus^13,14. These data have been analyzed employing a combination of methods from concatenated supergenes to coalescence and congruence analysis of individual gene trees to explore genetic variability across multiple loci¹³. The use of multiple methods has revealed that a combination of evolutionary factors limits our ability to infer phylogenetic relationship among Anastrepha, such as introgression, recent and rapid divergence, and high levels of ancestral polymorphism^13,15,16.

The differentiation between ancestral polymorphism and introgression is not trivial¹⁷ but recent phylogenomic studies have indicated that recent divergence and introgression may contribute to complicate species identification and recognition in Anastrepha^13,14. As more genomic data has emerged, it has become clearer that introgression has a much larger role in differentiation across several different taxa^18,19,20,21. This may have happened because of historical and/or recurring events, and may even be involved with species adaptation to different selective pressures²²a process that can increase pest adaptability to new environments, but influence proper phylogenetic inferences and a better understanding of their diversity²³. This may be one of the reasons why different genetic markers have produced discordant phylogenetic inferences in Anastrepha, particularly among more closely related species in the fraterculus group, and has limited our ability to tackle the differentiation of several lineages in the A. fraterculus s.l., which is considered to be a species complex^24,25,26,27. The ability to investigate a greater number of specimens across their distribution may be essential to help investigating species boundaries and evolutionary forces involved in their differentiation, but the cost of whole genomic analyses for a geographically comprehensive sampling may still be prohibitive.

The use of a reduced subset of markers for phylogenetic analyses, or phylogenomic subsampling, has emerged as a valuable approach^{28,29,30,31,32} to offset the costs of producing and analyzing large datasets³² particularly for population studies. The focus on a carefully selected subset of genes may enable targeting phylogenetically informative loci that help resolve contentious or unstable nodes while mitigating confounding factors such as saturation, alignment gaps^33,34 or even introgression and ancestral polymorphism^35,36. Different criteria have been used in subsampling protocols aimed at identifying reliable loci, including information quantity (e.g., alignment length or data completeness) and quality (e.g., phylogenetic signal or resistance to systematic biases). These approaches vary widely, with some focusing on evolutionary rates, favoring loci with fast, slow, or intermediate rates, or using rate-homogeneous partitions to reduce noise^37,38. Others employed the parameter referred to as phylogenetic informativeness (PI) to predict loci likely to resolve relationships accurately^39,40 or Internode Certainty (IC), that indicates loci that properly infer correct phylogenetic relationships⁴¹. Different strategies have⁴² used these criteria individually, or combined, to achieve adequate subsampling, but the results have been conflicting^32,33.

Finding a reduced set of markers that are capable of discriminating Anastrepha species, particularly among closely related taxa in the fraterculus group may be an invaluable tool for their understanding, monitor and control. The proper identification of these important pests at the species level is crucial for taking appropriate pest control measures. However, for many species, identification based on morphological characteristics requires highly trained taxonomists, and in some cases, the presence of intermediate or conflicting morphological characteristics may complicate this task⁴³. Using a reliable and cost-effective molecular strategy can help taxonomists to make confident decisions for individuals where morphological identification is ambiguous. Additionally, in cases where traditional methods are insufficient, molecular techniques may be the sole means of species discrimination, such as for males or individuals at early developmental stages⁴².

Previously, we have used over 3,000 loci to investigate phylogenetic relationships among species and lineages of Anastrepha and suggested that subsets of ~ 30 informative loci could be used for species identification in Anastrepha¹³. Here we test three independent protocols to purposefully subsample genomic/transcriptomic data to identify a potentially minimal set of loci that are both phylogenetically informative and robust against the effects of introgression and ancestral lineage sorting. The identification of a small subset of informative loci is critical for a broader analysis across large populations, but also to help design cost-effective strategies for species identification in Anastrepha, paramount for the development of tools to monitor and control several important pest species that belong to this genus.

Results

The Anastrepha dataset investigated here is derived from the reanalysis of whole genome sequencing, complete genome assemblies, or transcriptomes from 36 specimens of 15 Anastrepha species which included seven species groups and 17 samples from the A. fraterculus complex across South America and Mexico as well as five species of other tephritid genera, used as outgroups (Supplementary Table S1). This dataset consisted of 3,170 clusters of orthologs that had an average of 1,432 bases per cluster, providing a total of 4,551,036 bases and about 20.5% missing data for the ingroup. The fraterculus dataset was composed of a subset of the Anastrepha dataset, including 13 species of the fraterculus group, A. psidivora, as well as A. striata and A. bistrigata as outgroups. This set was composed of 3,168 clusters of orthologous with an average of 1,545 bases per cluster, totaling 4895,310 bases and about 21% of missing data. We explored the lack of phylogenetic signal at different parts of the topology to investigate which evolutionary forces might be influencing this pattern following the flowchart delineated on Fig. 1.

Species tree inference and concordance analyses

The topologies of the species trees produced using different methodological approaches (such as super-matrix via concatenation and multispecies coalescence) derived from the Anastrepha dataset (encompassing all Anastrepha samples and outgroups) as well as the fraterculus dataset (restricted to samples from the fraterculus species group, A. psidivora, and two outgroups) were highly congruent with each other and with previous analyses¹³. Figure 2 shows the ASTRAL inference derived from individual gene topologies and the maximum likelihood of the full concatenated dataset is on Supplementary Fig. S1. The respective results for the more reduced fraterculus set is shown as Supplementary Figs. S2 and S3. Most clades associated with potentially valid taxa had high support (100% bootstrap for the concatenated approach and 1.0 PP) for the multispecies coalescent approach, despite also showing high levels of phylogenetic conflict across different genes. This pattern is reflected on congruence values (measured by gCF, sCF, or quartet support) smaller than 50%, which tends to decrease when contrasting more closely related taxa, therefore, deeper nodes in general had greater support and more concordance across different markers. This pattern is found on both datasets, but since their results are very similar, we show here the PhyParts tree based on the whole Anastrepha dataset (Fig. 3), and the fraterculus PhyParts tree as Supplementary Fig. S4.

Though in general there is an inverse relationship between congruence and evolutionary distance, there were some shallower nodes with high support as well, even among some intraspecific nodes. Because the topologies produced from both datasets do not vary from what previous analyses have shown, where a more detailed analysis of these relationships was presented¹³ we will only emphasize here aspects that are enriched by the congruence analyses. The main point of discordance between different datasets and methodologies entails the separation of clades III, IV, and V of A. fraterculus in the west South America and A. turpiniae. Both datasets produced basically the same topology among these taxa when comparing concatenated ML and ASTRAL inferences, although ASTRAL separated A. fraterculus Clades III and IV. Furthermore, the fraterculus dataset ML tree separates these different west South American A. fraterculus clades, but places A. turpiniae as a sister taxon.

The lack of resolution among lineages in the western South America is reflected in the congruence analyses, which shows 60–70% discordance across different genes for most of these relationships and quartet supports between 2% and 20% (Fig. 2). That is similar to what is observed across the fraterculus species group, with gCF support values for several different species failing to exceed 35%. In fact, the A. fraterculus complex exhibited considerable variation in gCF support, ranging from 11 to 70% for different lineages, some of which may represent different species^6,13. The support for the fraterculus species group, on the other hand, is very robust (75% of the genes supporting this lineage) and showing A. psidivora as sister taxon to the fraterculus group. As we compare more distantly related taxa, the level of congruence across different genes tend to increase, being around 48% for the comparisons across species groups and reaching over 90% for the separation of genera, placing Anastrepha more related to Rhagoletis than to Ceratitis, Bactrocera, and Zeugodacus; this is a result corroborated by other phylogenetic analyses⁴⁵ as well as by the current taxonomy, which places the former two genera in Trypetinae, whereas the latter are in Dacinae.

Despite high levels of incongruence across different loci on the genome, indicated by sCF and gCF values below 50% in several nodes, the consistency across datasets and methodologies provides support for the lineages we observed, which also correlates well with genera, species groups and species (even in the fraterculus complex). We explored these results by investigating whether we could find a smaller assortment of genes that might be more phylogenetically informative and congruent than the average loci as measured by multiple parameters.

Phylogenetic inferences and the search for informative genes

We employed three different methods to create subsets of informative markers: highest number of parsimony-informative sites, over 60% of sites with high sCF and gCF values, and SortaDate, which considers bipartition support and minimization of tip-to-root variation; we refer to these as “Info”, “sgC60”, and “Sortadate”, respectively (see methods for details). For each method, we also compared two different numbers of markers, 37 and 20 (see methods for rationale), which is reflected in dataset names (e.g., “Info_37”). The three strategies chosen to find a reduced set of informative genes (See list of genes and some parameters in Supplementary Table S2) produced similar, but not fully congruent topologies in general. Observed discrepancies involved closely related samples from the same species, especially with sets of 20 genes rather than 37, which can be visualized in the cophylo analyses (Fig. 4). Even though all major phylogenetic relationships were recovered in each of the subsets when compared to the full dataset (as can be seen on Fig. 4), tests of tree compatibility show that there is significant difference in their likelihood at explaining the full dataset using Shimodaira Hasegawa (SH), weighted SH, Expected Likelihood weight or unbiased tests (Supplementary Table S3), with the exception of sgC60_37. Like before, these results were very similar when comparing the two different main datasets, so we chose to present information only for the Anastrepha dataset, considering that the markers chosen should provide information across different divergence spectra in the genus, not only among fraterculus species group.

The “info” subsets, which chose the genes with highest number of informative characters, selected loci that were missing in some specimens, leading Info_20 and Info_37 datasets to have a smaller number of taxa when compared to the full dataset (Figs. 4C and F, 5D and 6D, contrasted against 5A and 6A). Overall, however, the topologies of Info_20 and Info_37 were compatible with that of the full dataset, and the missing taxa (A. hadracantha, A. leptozona, A. curitis, A. striata, A. psidivora, and a few A. fraterculus) were associated with lower congruence values in the full dataset results. The other two strategies, Sortadate and sgC60, produced topologies that represented all samples and with less phylogenetic discordance across loci (the remaining contrasts on Figs. 4 and 5, and 6, namely Figs. 4A and D, 5B and 6B for sgC60 and 4B, 4E, 5C, 6C for Sortadate).

The percentage of genes supporting a specific topology or specific nodes varied, but were generally higher in the subsets of informative genes compared to the whole datasets (Table 1; Figs. 5 and 6). Around 33% of the loci were informative and compatible with the topology produced by the full dataset, which had 38 total nodes. The subsets are represented by different sets of loci, with only three loci in common between sgC60 and Sortadate. Loci were also distributed across all chromosomes (Supplementary Fig. S5), although six of the seven markers on the X chromosome (scaffold NC_071503.1) came from the Info subsets. All subsets had a higher number of genes supporting each node, at the expense of the number of nodes resolved, and this was more pronounced in the smaller subsets of 20 genes (Table 1). The Sortadate strategy produced, on average, the best results, with the highest number of nodes recovered (16 of the 17 nodes that separate species and 36 of 37 nodes in general) by the largest number of genes (over 52%, compared to 32% for the full dataset), though sgC60 also recovered a high number of nodes (15 of 17 species nodes and 33 overall). Both strategies only failed to resolve the phylogenetics relationships among the west South American A. fraterculus clades and A. turpiniae. While the info subsets had the highest number of loci supporting their inference (> 50%), they missed several nodes and recovered only nine of the 17 node species. As support levels varied across different lineages and different subsets, we explored other parameters to understand specific patterns and processes associated with the sets of selected loci at different levels in the hierarchy (Supplementary Fig. S6 shows nodes that were used to represent separation of species, species groups, and genera).

Association of phylogenetic parameters to reduced subsets

We investigated several phylogenetic and evolutionary parameters across genes included in our subsets, summarized in Table 1 and visually represented in Fig. 7. In general, these parameters indicate that the genomic regions surrounding informative loci are evolving under purifying selection, indicated by average low values of omega (ω) for both the Anastrepha and fraterculus datasets (Table 1; Fig. 7A and E, and Supplementary Fig. S7). The fraterculus dataset, which compared more closely related taxa, had slightly higher average ω than the Anastrepha dataset (0.10 and 0.06, respectively), which could indicate saturation. However, a separate estimation of saturation between the datasets suggests that saturation more strongly affected the more divergent taxa and did not impact the phylogenetic signal (Table 1; Fig. 7C and G, and Supplementary Fig. S8). Additionally, average evolutionary rates were higher in Tephritidae than when comparing more closely related taxa in Anastrepha (Table 1; Fig. 7B and F, and Supplementary Fig. S9), indicating that the higher saturation values are not enough to quash the association between evolutionary distance and change. It is relevant, though, that all subsets had higher average coalescent time values than the Anastrepha dataset, indicating that these subsets are more effective at separating more closely related lineages. This is also reflected on average values of treeness (Table 1; Fig. 7D and H), which measures the degree to which the total length of the phylogenetic tree is retained in internal branches. Despite the presence of evolutionary signal, there is a great deal of incongruence reflected on the average values of Internode Certainty (Table 1) and their distribution across the genome (Supplementary Fig. S10). IC values were measured by contrasting the support of the main topology (sC) against the alternatives (sD1 and sD2) for each node, averaged over different taxonomic distances, from more closely related species in the fraterculus group to species groups. This analysis showed that in general there are no major discrepancies across their genomes, though there is more heterogeneity at lower taxonomic contrasts (Supplementary Fig. S10). A comparison of these estimates, with average f4-ratio per gene as a conservative estimate of introgression, suggested that introgression decreases as you investigate more divergent taxa (Table 1). Like IC estimates, f4-ratio values vary along the genome over different taxonomic distances, showing higher heterogeneity among more closely related taxa (Supplementary Fig. S11).

Table 1 Average values for different evolutionary and population parameters across the whole dataset and in each selected subset.

Full size table

Treeness and average IC were significantly higher in the sgC60 and Sortadate subsets, when compared to random gene subsets, and lower in the info subsets (Table 1; Fig. 7 and Supplementary Fig. S12). Interestingly, both info subsets had lower average IC values than random genes on the genome (Table 1, which also shows average values of random subsets of 20 and 37 loci for all attributes here investigated). The introgression estimate f4-ratio was significantly lower in info subsets than on the complete dataset, or in other subsets as well (Table 1 and Supplementary Fig. S13), but this could be a consequence of the absence of several closely related samples, so much so that there is no intrapopulational f4-ratio estimate from info subsets.

Discussion

Previously, we created a workflow to identify and align orthologs from different taxa that was very useful to infer phylogenetic relationships among different species of Anastrepha¹³. This is relevant when you consider that this genus is composed of numerous closely related taxa, especially in the fraterculus group where divergence may have been affected by gene flow^6,11,14,24 which complicates inference of their phylogenetic relationships. This complexity limits the ability of a reduced set of markers to effectively identify different taxa and affects practical and streamlined applications that would profit from the use of a smaller, more cost- and time-effective, subset. In that paper, we used a large dataset of 3,170 genes, as well as a more reduced set of over 100 phylogenetically informative loci to identify lineages in the fraterculus group which in general agrees with the current taxonomy¹³. Here we explore three strategies to identify smaller subsets of markers that would remain useful not only for species identification, but also for phylogenetic inference among more closely related taxa and even lineages in the A. fraterculus complex. The search for minimal gene sets is central to comparative genomics⁴⁶although due to current database sizes, it is dependent on phylogenetic scoring across extensive gene sets which may introduce biases, as genes vary across time and space in evolutionary and populational parameters⁴⁷. We tested three approaches to investigate phylogenetic relationships by leveraging different phylogenetic and population parameters that allowed us to identify subsets that were informative for the purpose of phylogenetic inference and lineage identification using groups of either 20 or 37 loci.

A central question of this study was to determine the effectiveness of smaller subsets in inferring accurate phylogenetic relationships across different levels of the phylogenetic hierarchy, in particular between species and species groups. In general, our results show the effectiveness of these reduced datasets to infer relationships among more distantly related, and even among more closely related species, based on the comparison to a genome-scale dataset. It is noteworthy that all three chosen strategies to find phylogenetically informative genomic regions for the Anastrepha phylogeny, be it considering 20 or 37 genomic regions, produced topologies that were very similar to the one produced with the full dataset of 3,170 genomic regions. Not only did these estimates produce high support values, but the subsets had higher congruence among genes than the full datasets. Furthermore, average gene support and Internode Certainty increased in our subsets compared either to the whole dataset, or with more encompassing groups, particularly at the species level, which indicates that discordance is being mitigated at the intended level.

All gene subsets show overall congruence with regards to the position of Anastrepha relative to the outgroups in any of the analyses we performed, confirming its purported monophyly¹². Furthermore, our results corroborate the use of the other Tephritidae genera here as adequate outgroups for this analysis, since it has been suggested that concordance, rather than bootstrap support, should derive outgroup delimitation¹⁶. Even with the commonalities, there are a few differences among inferences from the different subsets that are mostly related to shallow nodes and closely related taxa. One caveat is that subsets Info_20 and Info_37 failed to include all samples (missing the species A. hadracantha, A. leptozona, A. curitis, A. striata, A. psidivora, and some specimens in the fraterculus species group) and thus resolved only about half of the species nodes. This is most likely due to inclusion of genomic regions that were highly variable, large, and had only portions of the gene recaptured for different samples. Despite this sampling limitation for Info_20 and Info_37, these subsets produced the same topology as the full dataset for samples that were included. However, the more limited sampling affected introgression estimates, since there were several samples missing in the Info datasets that limited our ability to estimate introgression among closely related lineages of A. fraterculus. Additionally, given that the hierarchical levels analyzed here (genera, species groups, species) differ greatly in terms of genetic divergence, it is expected that the number of variable sites per category will also vary. However, despite this inherent bias, the gene sets remain reasonably consistent when considering the variability of categories including closer related taxa. For example, the majority of genes on Info_20 and Info_37 datasets remain as the most informative loci when comparing subsets that exclude outgroups, or even only among species in the fraterculus species group.

High mean gene congruence was observed across all subsets, indicating that the sampling was effective at procuring subsets of genes that better reflect phylogenetic relationships. Despite this congruence, there was a significant difference in likelihood support when comparing subset trees to trees from the full Anastrepha dataset using Shimodaira Hasegawa (SH), weighted SH, Expected Likelihood weight or approximately unbiased test, with the exception of sgC60_37 (Supplementary Table S3). Most of the few differences between the topologies of subsets and the full dataset affected only a few closely related taxa or taxa with established uncertainty. For example, the position of A. turpiniae along A. fraterculus Clade III, Clade IV, and Clade V, as well as the position of A. obliqua from Colombia, which is basal in the full Anastrepha dataset but in the middle of other Brazilian populations in the fraterculus group dataset. These incongruences can be clearly observed on the PhyParts analyses (Fig. 4), since the placing of A. fraterculus Clade V along A. turpiniae is supported only by about 5% of the genomic regions, whereas about 50% support any alternative relationships.

The different subsets effectively recovered phylogenetic relationships among taxa inferred from the full dataset, as previously obtained¹³despite the fact that there were different parameters chosen to create each subset. All subsets had higher average coalescent estimate than the full dataset and from random sets of genes. Overall, Sortadate and sgC60 shared more attributes to each other, such as higher average IC and treeness, whereas Info subsets had lower average IC across several parts of the hierarchy, lower introgression estimates, and higher levels of saturation. The Sortadate strategy chooses regions that show the lowest evolutionary rate variation across different branches combined with the highest bipartition support. The latter was estimated considering the topology inferred by the whole Anastrepha dataset from ASTRAL or ML. Sortadate subsets, regardless of using 20 or 37 loci, showed average values for most evaluated parameters. However, they exhibited the highest average treeness, the longest average coalescent times per node overall and per species node, suggesting that these Sortadate sets produce longer branches descending from internal nodes. The sgC60 subsets had on average the highest average IC values when compared to other regions on the genome, for several levels of the hierarchy. This result is not surprising, considering this parameter is based on sC values across different nodes. sgC60 subsets also showed higher dN/dS rates estimated from more distantly related species (though not significant) and among more closely related taxa in the fraterculus dataset. Despite higher values of omega, in general the sgC60 subset had significantly lower evolutionary rates than other random genes across the entire genome. Finally, neither sgC60 nor Sortadate showed significantly different f4-ratio estimates from random genes, or the full dataset (though Sortadate_37 was marginally higher), indicating that in general these genes did not show different rates of introgression compared to other regions of the genome. Based on these comparisons, Info subsets exhibited lower phylogenetic consistency than the subsets produced using the other two strategies. In particular, the Sortadate subsets demonstrated a good balance of conservativeness, characterized by the lowest missing data, highest coalescent averages, high Treeness, and average Internode Certainty (IC) values that effectively reproduced the original topology from the full dataset. This finding positions the Sortadate set as an excellent candidate for further investigation, including samples across the genus and particularly within the intricate fraterculus group.

The general pattern of phylogenetic incongruence among different regions across the genome may be a consequence of differential responses to adaptation (and different evolutionary rates might be an indication of that), but most often is a consequence of other processes as well, such as ancestral lineage sorting and introgression^35,48,49. It is the combination of these different processes which complicates phylogenetic inferences in Anastrepha in general, and species in the fraterculus group in particular. A few studies have suggested that individual markers, such as morphological, behavioral, or molecular markers using a few loci, in some cases even a single locus, such as COI or ITS, might be informative to study variation across Anastrepha^50,51,52,53. Even though these markers in general may be effective at investigating aspects of variation across the genus, and can differentiate some species, our studies have indicated that they alone fail to provide resolution more broadly due to the complex evolution that has shaped this genus¹³.

Considering that several of these taxa diverged recently in the presence of gene flow^11,13,51and several samples studied here belong to the same species, there should be no expectation that consistent topological congruence would be found across the genome. The identification of genes that account for evolutionary processes such as introgression, gene flow, and hybridization is critical for clarifying phylogenetic relationships among closely related and cryptic species. Genome-wide scans for “barcode” loci are increasingly used to find informative genomic regions. For Anastrepha, this task is further complicated by their adaptation to diverse sets of hosts and environments^1,54,55which might produce differential selective pressures across the genome (and in other tephritid species have proved complex to study^56,57), as well as by other complex populational processes, such as asymmetric gene flow observed in some species^13,15. Understanding how introgression influences speciation aids in elucidating adaptive processes, and given the complex spatial and biological boundaries of species^58,59future studies must demand broader sampling not only of specimens from different localities and species, but also across the genome.

The analysis of sets of orthologs, followed by their reduction to smaller subsets of genes with phylogenetic potential, goes beyond an academic exercise. The genus Anastrepha includes many pest species that cause significant damage to fruit crops of great commercial value^43,60several of which are only properly differentiated with the help of qualified taxonomists. Developing genetic markers for identifying these species offers a promising tool that should complement traditional morphometric methods. This is even more important considering the preeminent use of adult female morphometric data for proper identification of different taxa in the genus, even among less closely related taxa^1,2. Many Anastrepha male adults, larvae, and pupae cannot currently be identified to species level based on morphology, due to a gap in knowledge or absence of informative characters, despite recent advances^3,4,42. Having a subset of informative genes could speed up and improve identification, allowing for more efficient and targeted control (e.g., species-specific management strategies which can reduce the costs of insecticides and other expensive, environmentally harmful methods). Additionally, diverse genomic markers, such as the ones identified here, provide flexibility for implementation with various technologies (e.g., amplicon sequencing⁶¹probe-based capture approaches) that might require specific genomic features (lack of introns, specific length, etc.). As sequencing data continues to become more accessible, genomic sequencing in deeper parts of the Anastrepha phylogeny, and particularly across the distribution of their several taxa should be paramount to establishing the most effective ways to identify informative loci.

Materials and methods

Data source and processing

We explored data previously produced from a pipeline that relied on the selection and alignment of orthologous genes through the integration of transcriptomic and genomic data in a phylogenetic framework. The pipeline we used is a slight modification of what has been described elsewhere^13,14. Different than in our previous analysis¹³we did not implement a filter based on taxon occupancy to prevent discarding potentially informative genes whose orthology inference has failed due to high evolutionary rates. The data we used was derived from genomic and transcriptomic sequences from Anastrepha specimens and five specimens from other genera of Tephritidae as outgroups (Supplementary Table S1). These data were used to produce two different datasets with aligned sequences from different specimens to perform the analyses hierarchically, with one focusing on species of the fraterculus group (fraterculus dataset) and a more encompassing dataset that investigated patterns of evolution across the Anastrepha genus (Anastrepha dataset). The Anastrepha dataset used 3,170 orthologs from 36 specimens of 15 species of Anastrepha (21 when considering lineages of fraterculus complex as separate taxa), as well as the outgroups, whereas the fraterculus dataset was comprised of 3,168 orthologs from 27 specimens of 8 species, that encompassed all samples from the fraterculus group, A. psidivora, which is considered incertae sedis, and two outgroups, A. striata and A. bistrigata; the latter set of samples has a greater emphasis on species of the fraterculus group¹³. It is important to mention that since different taxa in the A. fraterculus complex have not been formally separated into different species with proper names, we are using the term A. fraterculus s.l. to refer to an assemblage of lineages that forms a polyphyletic taxon, though we recognize several different lineages that have the support of different attributes as potentially valid separate taxa in the complex. When we consider all these taxa separately, we have a total of 13 taxa in the fraterculus dataset. The Anastrepha dataset expands the analyses also to other species groups and includes Zeugodacus (Bactrocera) cucurbitae, Bactrocera dorsalis, Bactrocera oleae, Ceratitis capitata, and Rhagoletis zephyria as outgroups (Supplementary Table S1).

Species trees inference and concordance analysis

We used DNA-based alignments for the subsequent phylogenetic analysis. Model-Test-NG⁶² to infer the best-fit nucleotide substitution model was performed based on the Bayesian information criterion (BIC) for each alignment of orthologous genes, independently for each set (fraterculus and Anastrepha), which was used in RAxML-NG⁶³ to infer Maximum likelihood (ML) gene trees with 200 rapid bootstrap replicates. Species trees were inferred by concatenating gene alignments into a super-matrix using a concatenation method and also by combining gene trees through multispecies coalescent approaches for both datasets. Concatenation analysis involved incorporating the best-fit model for each gene and employing 200 bootstraps using IQ-TREE v. 2.1.2⁶⁴. Multi-species coalescent trees were estimated based on ML gene trees using default parameters in ASTRAL-III v. 5.7.7⁶⁵. Phylogenetic support was evaluated through the gene concordance factor, local posterior probabilities (PP), and quartet support. The gene concordance factor represents the proportion of gene trees with compatible topologies with a branch in the species tree⁶⁶ calculated using IQ-TREE v. 2.1.2⁶⁴. Local PP and quartet support were computed based on the quartet frequency in the gene trees⁶⁷estimated using ASTRAL v. 5.7.7⁶⁵.

We investigated the proportion of genes within our comprehensive dataset that exhibit higher concordance with the species tree generated by ASTRAL using PhyParts⁶⁸ a bipartition-based method, for a concordance and conflict analysis with a bootstrap support threshold < 70%. Subsequently, we utilized the python script PhyPartsPieCharts.py (https://github.com/mossmatters/phyloscripts) to visualize the results with pie charts that show the number of gene trees that were concordant, conflicting, or uninformative with respect to the species tree in each node. We used Gotree⁶⁹ throughout the workflow to remove zero length branches and to root trees.

Search for phylogenetically informative genes

Previous studies have been successful at using molecular data to bring some phylogenetic resolution among different species of Anastrepha, especially those belonging to the fraterculus group, by using a large number of loci. Here we investigated whether we could find similar resolution using a more reduced set of phylogenetically informative genes, and to that effect we tried three alternative methods. We chose strategies that would define the smallest number of loci that were still effective at identifying a set of taxonomically and topologically well-supported nodes. The procedures here considered two sets of nodes that are informative at different hierarchical levels, one that separates more distantly related taxa, such as genera, species groups, and species in the genus Anastrepha, and another that investigates lineages of the A. fraterculus complex. We should mention that whenever a species group is represented by a single species, we allocated this node to the species group. On the other hand, lineages in the fraterculus species were considered as “species” when there was association to other traits that would support a stronger claim of its independence in the A. fraterculus species complex. This is akin to what has been used elsewhere¹³ though we expanded the nodes previously considered to account for more divergent taxa.

The initial strategies used to select phylogenetically informative sets, which is exploratory in nature, relied on data derived from the IQ-TREE v. 2.1.2 software^64,70. From these analyses, we tested two approaches to identify sets of informative genes for elucidating relationships among Anastrepha species, and particularly species in the fraterculus group. For the first approach, we used a strategy that simply considered the number of parsimoniously informative sites per gene, as defined by⁷¹ and retained all genes with over 1,000 variable sites, derived from IQ-TREE v. 2.1.2. This cutoff value was established because a cursory investigation of data generated on Congrains, et al.¹³ suggested that there might be topology resolution when investigating between 20 and 40 genes. In this case, there was a small window separating the class over 1000 sites from the others (data not shown). We selected all genes with over 1,000 variable sites (37 genes) that was referred to as Info_37. To make the other approaches/subsets comparable, we also selected 37 genes in the following strategies. The second approach retained genes that had sCF values above 60% for phylogenetically informative nodes (nodes defining the hierarchical levels discussed above), based on the ML phylogenetic tree inferred by that gene, as well as an average gCF values above 50%. This set is referred to as sgC60_37. For the third approach, we used the SortaDate software (https://github.com/FePhyFoFum/sortadate) employing a combination of two criteria to choose a subset of genes, largest bipartition support and minimization of tip-to-root variation. SortaDate assesses similarities between gene trees and the species tree, focusing on tree length variation and retains only genes that meet certain optimality criteria for phylogenetic information⁴⁴. This set is referred to as Sortadate_37. We also explored subsets for each of these strategies with a more reduced number of loci, and we present here results for 20 loci per subset (referred to as Info_20, sgC60_20, and Sortadate_20, respectively), which could make these analyses more amenable for further practical applications.

We evaluated the ability of the three strategies to produce a reduced number of loci that were still informative to identify lineages at different phylogenetic levels in the Anastrepha phylogeny by contrasting the inferred species trees produced on ASTRAL using only the selected genes with those derived from the full datasets. This was investigated by looking at the Gene Concordance Factor (gCF) using the IQ-TREE v. 2.1.2 to test the node support across the phylogeny, especially for well-defined groups, i.e., clades that are associated with taxonomically or morphologically recognized taxa (following Congrains, et al.¹³ and represented on results). This was further analyzed by using PhyParts⁶⁸.

Incomplete lineage sorting (ILS) and introgression

Given the history of recent divergence, incomplete lineage sorting (ILS) and introgression in Tephritidae^72,73particularly in Anastrepha^13,14,15,51we conducted specific analyses to investigate the relationship between concordance factors and the bootstrap method generated by IQ-TREE v. 2.1.2. We used these concordance factors to test the assumptions of an Incomplete Lineage Sorting (ILS) model.

We investigated for introgression using the Dsuite computational tool (available at https://github.com/millanek/Dsuite) to calculate Patterson’s D statistics, known as ABBA-BABA metrics along the genome⁷⁴. These parameters were adopted to infer signs of introgression between lineages. Attention in this study focused on exploring alleles of common ancestry among combinations of three species arrangements that are phylogenetically tenable. Due to the heterogeneous distribution of the locations of the genes across the genome, we also performed the f4-ratio analysis⁷⁵ which examines introgression signs delimited to specific genomic intervals. We used several sets of species (see Supplementary Table S4) to estimate a compounded general estimate of introgression per gene that would not depend on an individual species set, but rather, would indicate the general propensity of each marker to move across lineage (or “species”) boundaries. We used Benjamini-Hochberg (BH) corrections to adjust p-values to account for multiple tests and graphically represented figures using Microsoft Excel and R⁷⁶ .

For the purpose of investigating introgression, we treated different lineages of A. fraterculus as separate species when there were other traits to support that separation, following previous phylogenetic analyses that indicated several clades^13,14,25referred here as A. fraterculus clades I through VII. Furthermore, some of these analyses were only performed for taxa for which we had more than one specimen sampled.

Phylogenetic and evolutionary parameters

Several parameters were estimated from aligned orthologs to indicate levels of polymorphism, phylogenetic signals, phylogenetic congruence, and levels of introgression using different procedures, which we used to contrast each reduced subset to the full dataset. This analysis aimed to investigate whether these parameters were associated with different gene subsets. We estimated transition and transversion evolutionary rates calculated as the total length of the tree divided by the number of terminals⁷⁷ and their levels of saturation, using the Saturation test⁷⁸ in Phykit⁷⁹. We also used Phykit to estimate evolutionary rates, defined as one minus the sum of squared frequency of different bases at a given site, and treeness⁸⁰ which investigates how well evolutionary distances among specimens are better explained by trees, rather than networks, with higher values denoting a more significant signal-to-noise relationship.

We also used IQ-TREE v. 2.1.2^64,70 to produce a consensus phylogenetic tree based on gene ML trees. The analysis employed 1,000 bootstrap replicates to assess statistical robustness of the consensus tree. We employed a heuristic search process with 100 clusters for sequence grouping (--scf 100), with Coalescence Hidden Markov Model (CHMM), and the double-forest approach (--df-tree), with the verbose option (--cf-verbose). These analyses were used to estimate gCF (gene concordance factor = gCF_N/gN%), sCF (site concordance factor with an average of 100 quartets = sCF_N/sN%), and sC (number of concordant sites with an average of more than 100 quartets), as well as average coalescent times per node. We estimated Internode Certainty per node per gene according to Kobert, et al.⁴¹ and also the average coalescence time in coalescent units per node using IQ-TREE¹⁶. Individual values were averaged over branches depending on the comparisons performed.

We investigated patterns of selection on individual gene regions by estimating the ratio of nonsynonymous to synonymous changes (ω) on clusters of orthologs across branches of phylogenetic trees. The number of synonymous (dS) and nonsynonymous (dN) changes and its ratio (dN/dS or ω) reflect patterns of selection on the region, whereby a ω > 1 is considered strong evidence of positive selection for amino acid substitutions, while ω ≈ 0 indicates purifying selection⁸¹. This analysis was performed using the branch-site model Busted⁸²implemented in Datamonkey⁸³. Additionally, we generated 200 groups of 20 and 37 loci randomly pulled from the whole dataset and the set of genes selected in a previous study¹³ using a custom python script (https://github.com/popphylotools/sampling_random_trees). This analysis allowed us to compare the distribution of each parameter (ω, evolutionary rates, saturation, and treeness) across the full dataset, the selected subsets, and the randomly subsampled group of genes.

Tests of tree compatibility

The software IQ-TREE v. 2.1.2⁶⁴ was used to compare whether the topology produced by each subset is significantly different than the topology inferred by the Anastrepha dataset. We tested whether maximum likelihood scores of each subset was significantly different than the score of the full dataset using the Shimodaira-Hasegawa test⁸⁴expected likelihood weights⁸⁵and the approximately unbiased test⁸⁶ performed in IQ-TREE v. 2.1.2⁶⁴. A visual comparison of the topology produced by the full Anastrepha dataset and each subset topology was performed on R⁷⁶ using the function cophylo in Phytools⁸⁷ to create co-phylogenetic plots and identify incongruences between the topology produced by the full dataset and that of each subset.

Gene mapping

We used Exonerate⁸⁸ to align one sequence for each ortholog cluster against the genome of A. ludens⁸⁹ (GenBank: GCA_028408465.1) using default parameters and the est2genome model to generate a physical map indicating the location of genes on chromosomes. To do so, we chose sequences from A. ludens, whenever available, otherwise we used sequences of A. distincta, which is closely related to A. ludens. The interactive mapping visualization across chromosomes was performed using ChromoMap v0.4.1⁹⁰. These analyses were performed on the full set of markers, as well as on the phylogenetically reduced sets of 20 and 37 markers.

Data availability

Accession codes are presented on Supplementary Table S2.

References

Norrbom, A. L. et al. Anastrepha and Toxotrypana: descriptions, illustrations, and interactive keys, http://delta-intkey.com/ (2012).
Norrbom, A. L. & Zucchi, R. A. & Hernández-Ortiz, V. in Fruit Flies (Tephritidae): Phylogeny and Evolution of Behavior (eds (eds Aluja, M., Allen, L. & Norrbom) 299–342 (CRC, (2000).
Rodriguez, E. et al. Exceptional larval morphology of nine species of the anastrepha mucronota species group (Diptera, Tephritidae). ZooKeys 1127 https://doi.org/10.3897/zookeys.1127.84628 (2022).
Rodriguez, E. J. et al. Description of larvae of Anastrepha amplidentata and Anastrepha Durantae with review of larval morphology of the fraterculus group (Diptera: Tephritidae). Proc. Entomol. Soc. Wash. 123, 169–189 (2021).
Article Google Scholar
Dutra, V. S., Ronchi-Teles, B., Steck, G. J. & Silva, J. G. Egg morphology of Anastrepha spp. (Diptera: Tephritidae) in the fraterculus group using scanning electron microscopy. Ann. Entomol. Soc. Am. 104, 16–24. https://doi.org/10.1603/AN10105 (2011).
Article Google Scholar
Dias, V. S. et al. An integrative multidisciplinary approach to Understanding cryptic divergence in Brazilian species of the anastrepha fraterculus complex (Diptera: Tephritidae). Biol. J. Linn. Soc. 117, 725–746. https://doi.org/10.1111/bij.12712 (2016).
Article Google Scholar
Silva, J. G. & Barr, N. B. in Proc 7th Int Symp Fruit Flies Econ Importance 10–15 Sept 2006, Salvador Brazil (eds R. L. Sugayama, Roberto A. Zucchi, S. M. Ovruski, & John Sivinski) 13–28Color Press, (2008).
Abraham, S., Cladera, J., Goane, L. & Vera, M. T. Factors affecting Anastrepha fraterculus female receptivity modulation by accessory gland products. J. Insect Physiol. 58, 1–6 (2012).
Article CAS PubMed Google Scholar
Barr, N. B., Cui, L. W. & Mcpheron, B. A. Molecular systematics of nuclear gene period in genus Anastrepha (Tephritidae). Ann. Entomol. Soc. Am. 98, 173–180 (2005).
Article CAS Google Scholar
Sobrinho, I. S. & de Brito, R. A. Positive and purifying selection influence the evolution of Doublesex in the Anastrepha fraterculus species group. PLoS ONE. 7, e33446. https://doi.org/10.1371/journal.pone.0033446 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Scally, M. et al. Resolution of inter and intra-species relationships of the West Indian fruit fly anastrepha obliqua. Mol. Phylogen Evol. 101, 286–293. https://doi.org/10.1016/j.ympev.2016.04.020 (2016).
Article CAS Google Scholar
Mengual, X. et al. Phylogenetic relationships of the tribe toxotrypanini (Diptera: Tephritidae) based on molecular characters. Mol. Phylogen Evol. 113, 84–112. https://doi.org/10.1016/j.ympev.2017.05.011 (2017).
Article CAS Google Scholar
Congrains, C. et al. Phylogenomic analysis provides diagnostic tools for identification of species complex. Evol. Appl. 16, 1598–1618. https://doi.org/10.1111/eva.13589 (2023).
Article PubMed PubMed Central Google Scholar
Congrains, C., Zucchi, R. A. & de Brito, R. A. Phylogenomic approach reveals strong signatures of introgression in the rapid diversification of Neotropical true fruit flies (Anastrepha: Tephritidae). Mol. Phylogen Evol. 162, 107200. https://doi.org/10.1016/j.ympev.2021.107200 (2021).
Article Google Scholar
Díaz, F. et al. Evidence for introgression among three species of the Anastrepha fraterculus group, a radiating species complex of fruit flies. Front. Genet. 9 https://doi.org/10.3389/fgene.2018.00359 (2018).
Lanfear, R. & Hahn, M. W. The meaning and measure of concordance factors in phylogenomics. Mol. Biol. Evol. 41 https://doi.org/10.1093/molbev/msae214 (2024).
Hibbins, M. S. & Hahn, M. W. Phylogenomic approaches to detecting and characterizing introgression. Genetics 220 https://doi.org/10.1093/genetics/iyab173 (2021).
Suvorov, A. et al. Widespread introgression across a phylogeny of 155 drosophila genomes. Curr. Biol. 32, 111–123e115. https://doi.org/10.1016/j.cub.2021.10.052 (2022).
Article CAS PubMed Google Scholar
Noguerales, V. & Ortego, J. Genomic evidence of speciation by fusion in a recent radiation of grasshoppers. Evolution 76, 2618–2633. https://doi.org/10.1111/evo.14508 (2022).
Article CAS PubMed PubMed Central Google Scholar
Árnason, Ú., Lammers, F., Kumar, V., Nilsson, M. A. & Janke, A. Whole-genome sequencing of the blue Whale and other Rorquals finds signatures for introgressive gene flow. Sci. Adv. 4, eaap9873. https://doi.org/10.1126/sciadv.aap9873 (2018).
Article ADS PubMed PubMed Central Google Scholar
Thawornwattana, Y., Seixas, F. A., Yang, Z. & Mallet, J. Full-Likelihood genomic analysis clarifies a complex history of species divergence and introgression: the example of the erato-sara group of Heliconius butterflies. Syst. Biol. syac009. https://doi.org/10.1093/sysbio/syac009 (2022).
Edelman, N. B. & Mallet, J. Prevalence and adaptive impact of introgression. Annu. Rev. Genet. 55, 265–283. https://doi.org/10.1146/annurev-genet-021821-020805 (2021).
Article CAS PubMed Google Scholar
Daubin, V., Moran, N. A. & Ochman, H. Phylogenetics and the cohesion of bacterial genomes. Science 301, 829–832. https://doi.org/10.1126/science.1086568 (2003).
Article ADS CAS PubMed Google Scholar
Selivon, D. et al. Morphological, behavioral, and ecological traits support the existence of three Brazilian species of the Anastrepha fraterculus complex of cryptic species. Front. Ecol. Evol. 10 https://doi.org/10.3389/fevo.2022.836608 (2022). Genetical.
Hernández-Ortiz, V., Canal, A., Salas, N. T., Ruíz-Hurtado, J. O. & Dzul-Cauich, J. F. F. M. Taxonomy and phenotypic relationships of the Anastrepha fraterculus complex in the Mesoamerican and Pacific Neotropical dominions (Diptera, Tephritidae). ZooKeys 540, 95–124, (2015). https://doi.org/10.3897/zookeys.540.6027
Hendrichs, J., Vera, M. T., De Meyer, M. & Clarke, A. R. Resolving cryptic species complexes of major tephritid pests. ZooKeys, 5–39, (2015). https://doi.org/10.3897/zookeys.540.9656
Vaníčková, L. et al. Current knowledge of the species complex Anastrepha fraterculus (Diptera, Tephritidae) in Brazil. ZooKeys, 211–237, (2015). https://doi.org/10.3897/zookeys.540.9791
Chen, M. Y., Liang, D. & Zhang, P. Selecting question-specific genes to reduce incongruence in phylogenomics: a case study of jawed vertebrate backbone phylogeny. Syst. Biol. 64, 1104–1120 (2015).
Article CAS PubMed Google Scholar
Edwards, S. V. Phylogenomic subsampling: a brief review. Zool. Scr. 45, 63–74 (2016).
Article Google Scholar
Molloy, E. K. & Warnow, T. To include or not to include: the impact of gene filtering on species tree Estimation methods. Syst. Biol. 67, 285–303. https://doi.org/10.1093/sysbio/syx077 (2018).
Article PubMed Google Scholar
Doyle, V. P., Young, R. E., Naylor, G. J. P. & Brown, J. M. Can we identify genes with increased phylogenetic reliability?? Syst. Biol. 64, 824–837 (2015).
Article CAS PubMed Google Scholar
Mongiardino Koch, N. Phylogenomic subsampling and the search for phylogenetically reliable loci. Mol. Biol. Evol. 38, 4025–4038. https://doi.org/10.1093/molbev/msab151 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fernández, R., Edgecombe, G. D. & Giribet, G. Exploring phylogenetic relationships within myriapoda and the effects of matrix composition and occupancy on phylogenomic reconstruction. Syst. Biol. 65, 871–889 (2016).
Article PubMed PubMed Central Google Scholar
Sharma, P. P. et al. Phylogenomic interrogation of arachnida reveals systemic conflicts in phylogenetic signal. Mol. Biol. Evol. 31, 2963–2984 (2014).
Article CAS PubMed Google Scholar
Steenwyk, J. L., Li, Y., Zhou, X., Shen, X. X. & Rokas, A. Incongruence in the phylogenomics era. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00620-x (2023).
Article PubMed PubMed Central Google Scholar
Moran, B. M. et al. The genomic consequences of hybridization. eLife 10, e69016, (2021). https://doi.org/10.7554/eLife.69016
Cummins, C. A. & McInerney, J. O. A method for inferring the rate of evolution of homologous characters that can potentially improve phylogenetic inference, resolve deep divergence and correct systematic biases. Syst. Biol. 60, 833–844 (2011).
Article PubMed Google Scholar
Li, X. et al. Phylogenomics reveals accelerated late cretaceous diversification of bee flies (Diptera: Bombyliidae). Cladistics 37, 276–297 (2021).
Article PubMed Google Scholar
Townsend, J. P., Su, Z. & Tekle, Y. I. Phylogenetic signal and noise: predicting the power of a data set to resolve phylogeny. Syst. Biol. 61, 835 (2012).
Article CAS PubMed Google Scholar
Alda, F. et al. Resolving deep nodes in an ancient radiation of Neotropical fishes in the presence of conflicting signals from incomplete lineage sorting. Syst. Biol. 68, 573–593 (2019).
Article PubMed Google Scholar
Kobert, K., Salichos, L., Rokas, A. & Stamatakis, A. Computing the internode certainty and related measures from partial gene trees. Mol. Biol. Evol. 33, 1606–1617. https://doi.org/10.1093/molbev/msw040 (2016).
Article CAS PubMed PubMed Central Google Scholar
Steck, G. J. et al. in In Area-wide Management of Fruit Fly Pests. 57–88 (eds Pérez-Staples, D., Díaz‐Fleischer, F., Montoya, J. P. & Vera, M. T.) (CRC, 2019).
Schutze, M. K., Virgilio, M., Norrbom, A. & Clarke, A. R. Tephritid integrative taxonomy: where we are now, with a focus on the resolution of three tropical fruit fly species complexes. Annu. Rev. Entomol. 62, 147–164. https://doi.org/10.1146/annurev-ento-031616-035518 (2017).
Article CAS PubMed Google Scholar
Smith, S. A., Brown, J. W. & Walker, J. F. So many genes, so little time: A practical approach to divergence-time Estimation in the genomic era. PLoS ONE. 13, e0197433. https://doi.org/10.1371/journal.pone.0197433 (2018).
Article CAS PubMed PubMed Central Google Scholar
He, R. et al. Phylogenomic analysis and molecular identification of true fruit flies. Front. Genet. 15 https://doi.org/10.3389/fgene.2024.1414074 (2024).
Koonin, E. V. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, 309–338. https://doi.org/10.1146/annurev.genet.39.073003.114725 (2005).
Article CAS PubMed Google Scholar
Lafond, M., Meghdari Miardan, M. & Sankoff, D. Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics 34, i366–i375. https://doi.org/10.1093/bioinformatics/bty242 (2018).
Article CAS PubMed PubMed Central Google Scholar
Mirarab, S., Nakhleh, L. & Warnow, T. Multispecies coalescent: theory and applications in phylogenetics. Ann. Rev. Ecol. Evol. Syst. 52, 247–268. https://doi.org/10.1146/annurev-ecolsys-012121-095340 (2021).
Article Google Scholar
Edwards, S. V. et al. Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics. Mol. Phylogen Evol. 94, 447–462. https://doi.org/10.1016/j.ympev.2015.10.027 (2016).
Article Google Scholar
Smith-Caldas, M. R. B., Mcpheron, B. A., Silva, J. G. & Zucchi, R. A. Phylogenetic relationship among species of the fraterculus group (Anastrepha: diptera: Tephritidae) inferred from DNA sequences of mitochondrial cytochrome oxidase I gene. Neotrop. Entomol. 30, 565–573 (2001).
Article CAS Google Scholar
Prezotto, L. F., Perondini, A. L. P., Hernández-Ortiz, V., Frías, D. & Selivon, D. What can integrated analysis of morphological and genetic data still reveal about the anastrepha fraterculus (Diptera: Tephritidae) cryptic species complex?? Insects 10 (408). https://doi.org/10.3390/insects10110408 (2019).
Barr, N. B. et al. Identifying anastrepha (Diptera; Tephritidae) species using DNA barcodes. J. Econ. Entomol. 111, 405–421. https://doi.org/10.1093/jee/tox300 (2017).
Article CAS Google Scholar
Sutton, B. D. et al. Nuclear ribosomal internal transcribed spacer 1 (ITS1) variation in the anastrepha fraterculus cryptic species complex (Diptera, Tephritidae) of the Andean region. ZooKeys 175-191 https://doi.org/10.3897/zookeys.540.6147 (2015).
Zucchi, R. A. in Moscas-das-frutas no Brasil (ed H.M.L. Souza) 1–10 (Anais Fundação Cargil, (1988).
Zucchi, R. A. in In Moscas-das-frutas De Importância Econômica No Brasil: Conhecimento Básico E Aplicado. 41–48 (eds Malavasi, A. & Zucchi, R. A.) (Holos Editora, 2000).
Egan, S. P. et al. Experimental evidence of genome-wide impact of ecological selection during early stages of speciation-with-gene-flow. Ecol. Lett. 18, 817–825. https://doi.org/10.1111/ele.12460 (2015).
Article PubMed PubMed Central Google Scholar
Feder, J. L. et al. Genome-Wide congealing and rapid transitions across the speciation continuum during speciation with gene flow. J. Hered. 105, 810–820. https://doi.org/10.1093/jhered/esu038 (2014).
Article PubMed Google Scholar
Hedrick, P. W. Adaptive introgression in animals: examples and comparison to new mutation and standing variation as sources of adaptive variation. Mol. Ecol. 22 (2013).
Harrison, R. G., Larson, E. L. & Hybridization Introgression, and the nature of species boundaries. J. Hered. 105, 795–809. https://doi.org/10.1093/jhered/esu033 (2014).
Article PubMed Google Scholar
Aluja, M. Bionomics and management of anastrepha. Annu. Rev. Entomol. 39, 155–178 (1994).
Article Google Scholar
Dupuis, J. R. et al. HiMAP: robust phylogenomics from highly multiplexed amplicon sequencing. Mol. Ecol. Resour. 18, 1000–1019. https://doi.org/10.1111/1755-0998.12783 (2018).
Article CAS Google Scholar
Darriba, D. et al. ModelTest-NG: A new and scalable tool for the selection of DNA and protein evolutionary models. Mol. Biol. Evol. 37, 291–294. https://doi.org/10.1093/molbev/msz189 (2019).
Article CAS PubMed Central Google Scholar
Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455. https://doi.org/10.1093/bioinformatics/btz305 (2019).
Article CAS PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534. https://doi.org/10.1093/molbev/msaa015 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, C., Rabiee, M., Sayyari, E. & Mirarab, S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19, 153. https://doi.org/10.1186/s12859-018-2129-y (2018).
Article Google Scholar
Minh, B. Q., Hahn, M. W. & Lanfear, R. New methods to calculate concordance factors for phylogenomic datasets. Mol. Biol. Evol. 37, 2727–2733. https://doi.org/10.1093/molbev/msaa106 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sayyari, E. & Mirarab, S. Fast Coalescent-Based computation of local branch support from quartet frequencies. Mol. Biol. Evol. 33, 1654–1668. https://doi.org/10.1093/molbev/msw079 (2016).
Article CAS PubMed PubMed Central Google Scholar
Smith, S. A., Moore, M. J., Brown, J. W. & Yang, Y. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants. BMC Evol. Biol. 15, 150. https://doi.org/10.1186/s12862-015-0423-0 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lemoine, F. & Gascuel, O. Gotree/Goalign: toolkit and go API to facilitate the development of phylogenetic workflows. NAR Genomics Bioinf. 3 https://doi.org/10.1093/nargab/lqab075 (2021).
Mo, Y. K., Lanfear, R., Hahn, M. W. & Minh, B. Q. Updated site concordance factors minimize effects of homoplasy and taxon sampling. Bioinformatics 39 https://doi.org/10.1093/bioinformatics/btac741 (2022).
Nei, M. & Kumar, S. Molecular Evolution and Phylogenetics (Oxford University Press, 2000).
San Jose, M. et al. Interspecific gene flow obscures phylogenetic relationships in an important insect pest species complex. Mol. Phylogen Evol. 188, 107892. https://doi.org/10.1016/j.ympev.2023.107892 (2023).
Article CAS Google Scholar
Zhang, Y. et al. Phylogenomic resolution of the ceratitis FARQ complex (Diptera: Tephritidae). Mol. Phylogen Evol. 161, 107160. https://doi.org/10.1016/j.ympev.2021.107160 (2021).
Article Google Scholar
Malinsky, M., Matschiner, M. & Svardal, H. Dsuite - Fast D-statistics and related admixture evidence from VCF files. Mol. Ecol. Resour. 21, 584–595. https://doi.org/10.1111/1755-0998.13265 (2021).
Article PubMed Google Scholar
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093. https://doi.org/10.1534/genetics.112.145037 (2012).
Article PubMed PubMed Central Google Scholar
R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, (2023).
Telford, M. J. et al. Phylogenomic analysis of echinoderm class relationships supports asterozoa. Proc. Royal Soc. Lond. B. 281, 20140479. https://doi.org/10.1098/rspb.2014.0479 (2014).
Article Google Scholar
Philippe, H. et al. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 9, e1000602. https://doi.org/10.1371/journal.pbio.1000602 (2011).
Article CAS PubMed PubMed Central Google Scholar
Steenwyk, J. L. et al. PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data. Bioinformatics 37, 2325–2331. https://doi.org/10.1093/bioinformatics/btab096 (2021).
Article CAS PubMed PubMed Central Google Scholar
Phillips, M. J. & Penny, D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol. Phylogen Evol. 28, 171–185. https://doi.org/10.1016/S1055-7903(03)00057-5 (2003).
Article CAS Google Scholar
Yang, Z. & Bielawski, J. P. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15, 496–503 (2000).
Article CAS PubMed PubMed Central Google Scholar
Murrell, B. et al. Gene-Wide identification of episodic selection. Mol. Biol. Evol. 32, 1365–1371. https://doi.org/10.1093/molbev/msv035 (2015).
Article CAS PubMed PubMed Central Google Scholar
Weaver, S. et al. Datamonkey 2.0: A modern web application for characterizing selective and other evolutionary processes. Mol. Biol. Evol. 35, 773–777. https://doi.org/10.1093/molbev/msx335 (2018).
Article CAS PubMed PubMed Central Google Scholar
Shimodaira, H. & Hasegawa, M. Multiple comparisons of Log-Likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16, 1114–1114. https://doi.org/10.1093/oxfordjournals.molbev.a026201 (1999).
Article CAS Google Scholar
Strimmer, K. & Rambaut, A. Inferring confidence sets of possibly misspecified gene trees. Proc. Royal Soc. Lond. B. 269, 137–142. https://doi.org/10.1098/rspb.2001.1862 (2002).
Article Google Scholar
Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508. https://doi.org/10.1080/10635150290069913 (2002).
Article PubMed Google Scholar
Revell, L. J. phytools 2.0: an updated R ecosystem for phylogenetic comparative methods (and other things). PeerJ 12, e16505, (2024). https://doi.org/10.7717/peerj.16505
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 6, 31. (2005).
Article Google Scholar
Congrains, C. et al. Chromosome-scale genome of the polyphagous pest anastrepha ludens (Diptera: Tephritidae) provides insights on sex chromosome evolution in anastrepha. G3 Genes|Genomes|Genetics. https://doi.org/10.1093/g3journal/jkae239 (2024).
Article PubMed PubMed Central Google Scholar
Pfeifer, B. & Kapan, D. D. Estimates of introgression as a function of pairwise distances. BMC Bioinform. 20, 207. https://doi.org/10.1186/s12859-019-2747-z (2019).
Article Google Scholar

Download references

Acknowledgements

RAB has been supported by grants # 2018/06611-5 and # 2022/12583-0 (FAPESP, Brazil) and FAPESP/CNPq INCT #14/50940-2. EMS had a graduate level fellowship from the CAPES (Coordination for the Improvement of Higher Education Personnel). PDF has been supported by the National Council for Scientific and Technological Development (CNPq # 309113/2025-3). This research was supported by a USDA-NIFA AFRI-FAS (2020-67013-30978) and USDA-NIFA HATCH (Project KY008091) grant to JRD. This study was supported by the United States Department of Agriculture (USDA) Plant Protection Act 7721 through a cooperative agreement with the University of Hawaii at Manoa, College of Tropical Agriculture and Human Resources (8130-0565-CA) to CC. Members of the Laboratory of Evolutionary Biology at UFSCar read a previous version of this manuscript and provided valuable input.

Author information

Authors and Affiliations

Departamento de Genética e Evolução, Universidade Federal de São Carlos, São Carlos, SP, Brasil
Reinaldo A. de Brito, Edyane Moraes dos Santos & Patricia Domingues de Freitas
Department of Entomology, University of Kentucky, Lexington, KY, 40546, USA
Julian R. Dupuis
Department of Plant and Environmental Protection Services, University of Hawaii at Manoa, Honolulu, HI, 96822, USA
Carlos Congrains

Authors

Reinaldo A. de Brito
View author publications
Search author on:PubMed Google Scholar
Edyane Moraes dos Santos
View author publications
Search author on:PubMed Google Scholar
Patricia Domingues de Freitas
View author publications
Search author on:PubMed Google Scholar
Julian R. Dupuis
View author publications
Search author on:PubMed Google Scholar
Carlos Congrains
View author publications
Search author on:PubMed Google Scholar

Contributions

R.A.B., E.M.S., J.R.D. and C.C. conceived the experiment(s), E.M.S. generated the new datasets, R.A.B., E.M.S. and C.C. analyzed the results. All authors contributed to, and reviewed the manuscript.

Corresponding author

Correspondence to Reinaldo A. de Brito.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

de Brito, R.A., dos Santos, E.M., de Freitas, P.D. et al. Identifying sets of phylogenetically informative markers for Anastrepha (Diptera: Tephritidae). Sci Rep 15, 34258 (2025). https://doi.org/10.1038/s41598-025-16399-2

Download citation

Received: 31 March 2025
Accepted: 14 August 2025
Published: 01 October 2025
DOI: https://doi.org/10.1038/s41598-025-16399-2