Introduction

Species borders are important to identify and yet difficult to define1,2,3. Because bacteria are asexual organisms, they do not easily adhere to the species concepts developed for sexual organisms, i.e., the biological species concept (BSC) and related definitions4,5. Yet, multiple studies have reported that bacteria can be clustered into cohesive genetic entities and that they do not simply form a continuum of more or less diverged genomes6,7,8,9 A growing body of evidence is supporting the idea that the vast majority of bacteria do engage in gene flow and that this mechanism may be responsible for the cohesiveness of bacterial species10,11,12,13,14,15,16. Through diverse routes and vessels, bacterial cells are capable of exchanging genetic material, and the transferred DNA may be related enough to sequences of the recipient genome to be swapped through homologous recombination. On the other hand, other transfer events—termed horizontal gene transfers (HGTs)—do not require sequence relatedness or can rely on microhomologies for DNA integration but can introduce a new gene—or a new gene variant—at a different location in the genome, essentially adding to the existing genetic repertoire without replacing the original variant at that position17,18,19. From an evolutionary point of view, homologous recombination promotes the exchange of alleles between highly related homologous sequences, and while bacteria do not undergo sexual reproduction, this process is comparable to gene flow in sexual organisms. In contrast, HGT can lead to the gain of new genes previously absent in the recipient genome and can occur between more distantly related—or unrelated—sequences17,18,19. For the purpose of species definition, we and others have focused primarily on the patterns of gene flow (i.e., homologous recombination) as a force maintaining the genetic cohesiveness of bacterial species12,14,16,20,21. However, gene flow can also have the opposite effect: blurring species boundaries.

Earlier studies relying on multi-locus sequence typing (MLST) first observed clear discrepancies across sequence markers while attempting to classify and type bacterial strains of several lineages22,23,24,25,26,27. Notoriously, strains of the genus Neisseria were found to form “fuzzy” species due to the recombinogenic nature of this lineage26,28,29. Additional cases of species fuzziness were reported in subsequent MLST studies23,24,27, but the advent of full genome sequencing solved the issue of incongruent gene marker phylogenies by building phylogenetic consensuses across hundreds or thousands of genes30,31,32. Nevertheless, these studies provided some of the first evidence that gene flow can be somewhat porous across bacteria and that bacterial species borders may be fuzzy in some lineages33,34.

Gene flow is pervasive between closely related genomes of most species and rarely occurs between genomes showing more than 2–10% nucleotide divergence16. This restriction is thought to reflect the mechanistic constraints of the homologous recombination machinery, which requires the presence of stretches of identical nucleotides shared between the recipient strand and the incoming strand35. However, gene flow can occasionally occur between more distantly related genomes16,26,28,36,37. Such patterns were clearly detailed in a study comparing Campylobacter coli and Campylobacter jejuni, where a significant portion (~20%) of the genome of these two species appears to be engaging in gene flow, while the rest of the genome is substantially diverged (~85% sequence identity)37. Importantly, these exchanges of DNA occur through allelic replacements in the genomic backbone of these species (i.e., the core genome) and do not occur through the transfers and gains of accessory genes via HGT, which are frequently observed between distant lineages. Such patterns of gene flow between distinct species are reminiscent of introgression in sexual organisms. Here, we use the term ‘introgression’ to describe gene flow events between distinct bacterial species that result in the exchange of a homologous DNA fragment to the recipient genome. While ‘introgression’ is traditionally associated with meiotic recombination and hybridization in sexually reproducing organisms, we apply the term here to emphasize gene flow between species and between core genomes, as previously defined in bacteria37. This process is likely responsible for the species fuzziness observed in some bacterial lineages, but the prevalence and the impact of introgression on the delimitation of bacterial species have not been systematically assessed.

Here, we characterized the patterns of introgression across 50 major bacterial lineages. Our results indicate that introgression is common in core genomes across bacteria. However, introgression tends to be highly overestimated in several species due to the inaccuracy or inapplicability of species borders’ definition. Among the analyzed lineages, the genera EscherichiaShigella and Cronobacter presented the highest levels of introgression. Overall, our analysis indicates that bacterial species present various levels of fuzziness across lineages, and these could correspond to ongoing speciation events. However, most species appear clearly delineated based on core genome phylogenies.

Results

Prevalence of introgression across bacterial lineages

We first quantified the amount of DNA exchanged by gene flow between the core genomes of distinct species (i.e., introgression). We built and analyzed the core genome of 50 bacterial genera and classified all genomes within each genus into ANI-species based on the pairwise ANI of core genomes using a cutoff of 94–96% sequence identity as typically defined38. We generated a maximum-likelihood phylogenomic tree for each genus using the concatenated core genome alignments. The trees segregated the vast majority of the ANI-species into monophyletic groups (i.e., phylogenetic species) (Supplementary Data 1). To make a distinction between gene flow within species and gene flow between species, we refer to gene flow between species as introgression to parallel the processes described in sexual organisms. To assess the prevalence of the patterns of introgression between ANI-species within each of the 50 genera included in this study, we used an approach based on phylogeny and sequence relatedness to detect and quantify introgression (see “Methods” section). Briefly, for a given genus, we inferred introgression events based on the phylogenetic incongruency between gene trees and the core genome tree in a particular ANI-species. A gene sequence was inferred as introgressed between two ANI-species when forming a monophyletic clade that was inconsistent with the unrooted core genome phylogeny (Fig. 1A). To be categorized as introgressed, a core gene sequence was also required to be statistically more similar to the sequence of a different ANI-species than at least one sequence of the genomes of its own species (see “Methods” section). For each ANI-species, levels of introgression were expressed as the fraction of core genes that satisfied these two criteria (Fig. 1).

Fig. 1: Approach used to infer introgression across lineages.
figure 1

Genomes within the same genus were considered part of the same species (ANI-species) when sharing at least 94–96% average nucleotide identity (ANI) across the core genome. A Phylogenetic trees were inferred from the concatenated core genome alignment of each genus. B, C Phylogenetic trees were inferred for each core gene of each genus. Introgression events were inferred based on the phylogenetic incongruency between gene trees and the core genome tree in a particular ANI-species. On the example represented in this figure: the topology of tree ‘A’ is congruent with that of tree ‘B’ in all ANI-species. In contrast, the topology of the unrooted tree ‘C’ is incongruent with the topology of the core genome tree ‘A’ for ANI-species 1 and 2. This case represents a potential introgression event from ANI-species 1 to ANI-species 2. D Putative introgression events detected based on tree incongruencies were then inferred as introgression events if the relatedness of the introgressed sequence was statistically higher than the relatedness of the core genome of the two species (>2 mean S.D., see “Methods” section).

Using this approach, we observed that the genera studied here present various levels of introgression with an average of 8.13% (median 2.76 %) of introgressed genes across genera, suggesting that this process is rather common in bacteria (main text Fig. 2A, Supplementary Fig. 2, and Supplementary Data 1 and 2).

Fig. 2: Levels of introgression across species and genera.
figure 2

A Levels of introgression were expressed as the percentage of introgressed core genes between all pairs of species within each genus (2043 species within 41–47 genera). The percentage of introgressed core genes of each genus was estimated by classifying species into ANI-species (red) or by classifying species into BSC-species (green). ANI-species were defined using a cut-off of ≥95% along the core genome. BSC-species were defined as groups of genomes showing significant levels of gene flow based on the signal of homoplasic alleles (see Methods). B Boxplot of the percentage of introgressed core genes across pairs of BSC-species within each genus. Box plots indicate the distribution of values across 2043 species within 47 genera. The center line represents the median, the bounds of the box indicate the 25th and 75th percentiles (interquartile range, IQR), and the whiskers extend to the most extreme data points within 1.5 × IQR of the box. Data points beyond the whiskers represent outliers.

The previous results relied on the ANI as an empirical approach to define species. However, ANI-based methods are not anchored in a solid theoretical framework, and we have previously suggested that gene flow could be used to refine ANI-species borders following similar principles as in the BSC (i.e., BSC-species)16. Therefore, the levels of introgression estimated above may be inflated by the inaccurate identification of species borders (i.e., if two ANI-species may in fact represent a single BSC-species). Indeed, our results indicate that most of the introgression events occurred between closely related or sister ANI-species based on both the core genome phylogenies and the relatedness of the core genomes (Supplementary Figs. 1 and 3a, b, and Supplementary Data 3), and it is therefore very likely that slightly adjusting species borders would yield different estimates of introgression levels. To address this issue, we refined the borders of ANI-species based on the patterns of gene flow to generate BSC-species based on the signal of homoplasic alleles relative to non-homoplasic alleles (h/m), as previously described in ref. 16 (and see “Methods” section). Following this approach, we found that most of the closely related ANI-species that shared high levels of introgression were in fact classified into a single BSC-species (Supplementary Data 3 and 4). In most cases, BSC-species would match ANI-species borders by slightly adjusting the ANI threshold used to define species borders (note that, however, we cannot predict exactly how the ANI threshold needs to be adjusted, as it appears to be lineage- or species-specific). For instance, Streptococcus parasanguinis ANI-sp32 presented 33.2% of its core genome introgressed with S. parasanguinis ANI-sp67, but our analyses revealed that these two species form, in fact, a single BSC-species based on the signal of gene flow (these ANI-species were under the ANI threshold used to define ANI-species). Another example is the genus Pseudomonas, in which some ANI-species shared higher levels of introgression, but our analyses indicate that these ANI-species are part of the same BSC-species (BSC-sp155 and BSC-sp166, respectively) (Supplementary Fig. 1aj and Supplementary Data 24). ANI-sp155, which is composed of P. fragi’s strains, presents ~35% of its core genome introgressed with ANI-sp107, which is composed of misclassified P. fragi strains (ANI-sp107 represented 9 misclassified genomes sharing 99.5–99.6% sequence identity and submitted as P. fragi on NCBI). However, these two ANI-species are highly related, sharing between 94.3% and 94.6% sequence identity, which is slightly under the threshold of species definition. These results indicate that, in many cases, the higher levels of introgression detected between ANI-species do not correspond to true introgression events once we redefine species based on the signal of gene flow, but rather correspond to frequent homologous recombination events among genomes within the same (biological) species.

Fig. 3: Fraction of core gene sequence length impacted by introgression.
figure 3

The distribution of the fraction of the gene sequences affected by introgression was inferred using the introgression score (Si). Si was computed for recent introgression events. Introgressed regions were defined as 100 bp fragments that were more similar to the candidate species than at least one of the genomes of the reference species (each pair of BSC-species). Only introgressed fragments presenting 100% sequence identity between the candidate and reference species were considered as introgressed. We normalized the introgression values for each gene sequence by the sequence length.

After redefining ANI-species into BSC-species (Supplementary Data 3), we focused our analysis on the pairs of BSC-species where both BSC-species are composed of at least 15 genomes, to further ascertain that these species are truly distinct (low genome numbers affect our ability to reclassify species based on gene flow). Our results revealed that introgression impacts, on average, 2.16% (median 1.2%) of the core genome of bacterial species. Across genera, introgression ranged from 0% in seven genera to 13.98% (median 14.90%) in EscherichiaShigella (main text Fig. 2A, Supplementary Fig. 2, and Supplementary Data 1 and 2). Within each genus, BSC-species displayed various levels of introgression, where 0–29.5% of core genes were found to be introgressed between BSC-species (Fig. 2B and Supplementary Data 2). These results indicate that a few bacterial species are very fuzzy even after redefining species based on gene flow. Among the analyzed lineages, EscherichiaShigella and Cronobacter presented the highest levels of introgression (main text Fig. 2A, Supplementary Fig. 2, and Supplementary Data 1 and 2) with 13.98% (median 14.9%) and 10.78% (median 14.02%) of their core genome introgressed, respectively.

Because the number of core genes varies substantially across lineages, it is possible that our estimates of introgression are inaccurate in some lineages. However, no correlation was observed between the size of the core genome and the prevalence of introgression across genera (Spearman’s ρ = −0.07, P = 0.66, Supplementary Fig. 4a). If some genera were to present more closely related species than others, this could potentially lead to an overestimation of introgression events in these genera relative to others. We tested if this was the case by comparing the average sequence identity between species pairs across genera and compared this to their levels of introgression. Overall, we did not find a clear correlation between these two metrics: No significant correlation was observed between the species relatedness of each genus and their introgression levels when analyzing BSC-species (Rho = 0.07, P-value = 0.63). A slight positive correlation was observed for this correlation when focusing on ANI-species (Rho = 0.32, P-value = 0.02) (Supplementary Fig. 4bc). This last correlation is likely due to the fact that genera with misclassified species based on ANI thresholds are systematically closely related species that yield a signal of high levels of introgression. We also conducted simulations to assess the impact of sub-sampling imposed by the differences in core genome size, and we did not observe any substantial biases in our estimates of introgression (See Supplementary Data 5 and 6). We further reconstructed the core genome for eight pairs of BSC species that displayed high levels of introgression. Rebuilding the core genome for species pairs substantially increased the number of core genes that could be analyzed (although introgression events could only be inferred for each species pair), but these re-estimates of introgression remained similar when computed on these extended core genomes (Supplementary Fig. 5).

Our inference of introgression could theoretically be impacted by the number of genomes available for each genus, since additional cases of introgression could be uncovered as additional genomes are available. However, we did not find a significant correlation between the median number of introgressed genes and the total number of genomes available per genus (Rho = 0.08, P = 0.60, Supplementary Fig. 6). This suggests that the number of available genomes across genera does not substantially drive our estimates of introgressed genes. Finally, because individual gene trees can be poorly supported due to the lack of signal, we assessed whether the nodes supporting introgression displayed lower bootstrap supports than those supporting the monophyly of species. Overall, we observed very similar distributions of bootstrap supports for both categories of nodes, where >80% of the nodes displayed bootstrap values of 90 or higher (Supplementary Fig. 7).

Because introgression is mediated by homologous recombination, it is very likely that these sequences are only partially exchanged between species. The fractions of gene sequences impacted by introgression were analyzed by comparing sequence identity between species and within species as previously described in ref. 16. We found that introgressed core genes involved, on average, recombination of 20% of the entire sequence and only 36% of introgressed core genes presented more than half of the sequence having recombine (Fig. 3). Few of the introgressed core genes presented evidence of recombination of their entire sequence (Fig. 3), indicating that introgression events mostly occur through the replacement of short DNA fragments as previously described by the mechanism of homologous recombination39,40,41. Note that our approach likely underestimates the fraction of recombined gene, as it likely misses older introgression events. Nevertheless, we observed a similar trend when using different identity thresholds to define introgressed fragments (Supplementary Fig. 8).

Introgression events are not necessarily bidirectional, since a recipient species can gain DNA from a different taxon without providing any DNA in return. However, our results indicate that most of the BSC-species sharing high levels of introgression are bidirectional, i.e., gene flow has occurred in both directions (Supplementary Fig. 9a). However, this trend varies across genera, the levels of gene flow are unidirectional between species within some genera such as Haemophilus where H. influenzae (BSC-sp12) displayed 16.69% of its core genes introgressed with H. haemolyticus (BSC-sp10) but only 0.47% of core genes of H. haemolyticus have been introgressed with H. influenzae (Supplementary Fig. 9b and Supplementary Data 3). Furthermore, we analyzed the origin and the diversity genes introgressed. We focused on BSC-species with at least 3% of introgressed core genes, and we found that a single species is typically responsible for the majority of the transfers between species. In a few cases, however, a combination of two or three species are responsible for the majority of the transferred genes (Supplementary Fig. 9cd and Supplementary Data 3). Almost all the cases of introgression from a single donor species involve closely related species as their source of DNA.

Introgression is positively correlated with genome relatedness

We observed that higher levels of introgression typically occur between closely related BSC-species (main text Fig. 4 and Supplementary Fig. 1 and Supplementary Data 3), and we have previously hypothesized that the ability to engage in gene flow is linked to the ability of the recombination machinery to initiate homologous recombination more efficiently when sequences share high identity16. Most of the species within the genus Streptococcus that share the highest levels of introgression are highly related to one another. For instance, S. equinus (BSC-sp99) presents 11% of introgressed core genes shared with S. lutetiensis (BSC-sp3) (these species shared between 91 and 93% ANI). S. pyogenes (BSC-sp65) and S. dysgalactiae (BSC-sp8), which are close relatives (sharing between 89% and 91% ANI), constitute another example within this genus. Both species share over 15% of introgressed core genes with one another (Supplementary Fig. 1ar and Supplementary Data 3). Similarly, within the genus Cronobacter, Cronobacter BSC-sp4 (C. sakazakii) and BSC-sp8 (C. malonaticus) had over 14% of the core genes introgressed with one another (both species shared ~95% of ANI) (Supplementary Fig. 1n and Supplementary Data 3). The same trend was observed in many other genera, such as Acinetobacter, Aeromonas, Bacteroides, and EscherichiaShigella, where species typically show higher levels of introgression with their close relatives.

Fig. 4
figure 4

Relationship between levels of introgression and core genome sequence identity across all pairs of BSC species.

Although some genera display higher occurrences of introgression, it should be underlined that levels of introgression can vary sharply across species of the same genus (Fig. 2A and Supplementary Data 2). An illustrative example is the genus Vibrio, where species share between 0 to 29.5% of introgressed core genes. Similar trends were observed in several genera, such as EscherichiaShigella, Bifidobacterium, Aeromonas, Haemophilus, and Streptococcus, where species share typically between 0 to 26% of introgressed core genes, respectively (Fig. 2A, B and Supplementary Data 2). Since closely related species tend to engage in introgression more frequently than others, it is possible that the differences in introgression levels across genera are partially driven by the higher abundance of closely related species in some genera relative to others. However, a simple correlation between the average identity of BSC-species and their levels of introgression shows that there is no significant relationship between the overall species relatedness within a genus and our estimates of introgression. Although no significant linear correlation was observed, it should be noted that the highest levels of introgression were only observed between more closely related species (sequence identity > 90%).

Introgression and bacterial ecology

Our results suggest that some species are more prone to introgression than others within the same genus. This may be related to the ecological and physical proximity of these species, which would increase opportunities for genetic exchange between them. However, the ecology of most bacteria remains highly speculative, and we do not know the source of many of the isolates. Moreover, strains belonging to the same species often occupy diverse niches, making it impossible to ascribe a specific niche to every species. Nevertheless, we could analyze several BSC species that shared relatively high levels of introgression and that comprise strains that colonize at least two different ecological niches. BSC-species selected within the genera Campylobacter and EscherichiaShigella both comprise agricultural, environmental, and clinical isolates, but BSC-species selected within the genus Vibrio comprises only environmental and clinical isolates. We compared the patterns of introgression within and across different ecological environments (Supplementary Data 7). Overall, we found that within Campylobacter and the group EscherichiaShigella, agricultural isolates shared higher introgression levels with one another than with strains from clinical and environmental sources (Supplementary Data 8). Whereas in Vibrio, environmental isolates showed more introgressed genes among themselves than between environmental and clinical isolates. However, the introgression levels shared between the environmental and clinical isolates were higher than those between the isolates from the same clinical niches as observed in the two other genera above. The fact that environmental and clinical isolates shared higher introgression events than isolates from the same clinical niches could be explained by increased contact opportunities through contaminated water or food, hospital surfaces, and or medical devices (this is the case for opportunistic pathogens) that can facilitate genetic exchange between environmental and clinical strains. However, these results are based on a small number of species, and no conclusive trend emerges from these limited examples.

Introgression variation by gene function

Because introgression events are potentially adaptive, we analyzed the function of all introgressed genes. All core gene families were classified into COG (Categories of Orthologous Groups) categories using EggNog42 (Supplementary Fig. 10). We used a logistic regression to compare the representation of COG categories in the introgressed genes relative to the entire core genome (Supplementary Data 9) among all genera (Fig. 5 and Supplementary Data 10).

Fig. 5: Functions of introgressed genes relative to the entire core genome across all genera.
figure 5

P-values of COG categories were computed using a logistic regression model, which also included genera as predictors. The data comprises 33,187 unique genes across 24 genera. Statistical significance of coefficients were assessed using two-sided Wald tests, and P-values were adjusted for multiple testing with a Bonferroni correction. Error bars indicate 95% Wald confidence intervals around the estimated regression coefficients (log-odds of introgression). See the legend key for the description of each COG category. COG categories that are significantly more or less frequently introgressed relative to the core genome are colored in blue.

Overall, most functional categories were not significantly introgressed more frequently or less frequently than expected by chance (Fig. 5). Only genes involved in coenzyme metabolism and transport and Amino acid transport and metabolism were found to be significantly overrepresented among introgressed genes (P = 0.0006 and P = 0.02, logistic regression with Bonferroni correction, respectively). Interestingly, all the categories in the “Information storage and processing” were found to be less likely to introgress than expected, but only genes involved in the Translation, ribosomal, structure, and biogenesis were significantly under-represented (P < 10−5). Translation genes are typically evolving under strong purifying selection, and their lower frequency of introgression likely reflects epistatic incompatibilities between these genes. However, there is little difference between the frequencies of the COGs categories of the introgressed genes and those of the overall core genome, and this suggests that introgression is rarely adaptive. Nevertheless, introgressed genes most frequently encode metabolism-related functions, and these exchanges could occasionally be beneficial when colonizing a new environment or during fluctuations of resource availability.

Discussion

Our results report, overall, that introgression is prevalent across bacterial lineages. Among the 50 genera that we studied, 33 showed evidence of introgression, although some lineages were much more prone to introgression than others. Our results support previous findings: introgression is common in Neisseria16,43, Campylobacter16,36,37,44,45, EscherichiaShigella34, and Streptococcus46,47. Several species are indeed fuzzy, although the delimitation of species is possible, and refining species borders based on gene flow helps with these cases. For instance, frequent introgression within Neisseria makes it difficult to delimitate this genus into distinct species. However, inconsistent species delineation has been notoriously problematic in this genus: we classified N. mucosa, N. sicca, N. subflava, N. flavescens, and N. perflava into two different clusters (BSC-sp1 and BSC-sp12), and several closely related strains are currently assigned to different named species (Supplementary Fig. 1ae and Supplementary Data 2). Defining bacterial species using arbitrary ANI thresholds or based on species name can lead to greatly overestimating amounts of introgression. The original motivation of our study was based on the previous observations that bacterial species are often fuzzy, as shown by the frequent incongruencies of gene trees and species trees22,23,24,25. Our results indicate that, after slightly adjusting species borders by taking into account patterns of gene flow, the vast majority of gene trees are actually congruent with species trees.

The core genome phylogeny of several genera is potentially impacted when introgression levels are high. Our results show that very few BSC species display substantial amounts of introgression, but it remains possible that tree topologies may be distorted due to this process. Moreover, introgression events may have more important implications for the reconstruction of species phylogenies and for approaches that rely on rates of sequence evolution across strains (e.g., tests for selection). Indeed, introgression events, albeit rare, can introduce relatively high numbers of new alleles and inflate substitution rates at some loci.

The variability of introgression levels is potentially driven by ecological and geographic factors. Indeed, cohabitating the same ecological niche, such as a human host, likely provides opportunities for frequent genetic exchange between related species. This is the case of the genus Campylobacter with C. coli (BSC-sp34) and C. jejuni (BSC-sp2) colonizing the gut of humans and livestock, which could explain the high levels of introgression shared between these two species36,37,44 (Supplementary Fig. 1j and Supplementary Data 3). It has been hypothesized that C. coli and C. jejuni are genetically converging through introgression due to a recent ecological change48. Under this model, introgression events represent a source of adaptive alleles transferred between C. jejuni and C. coli48. The group EscherichiaShigella includes species that colonize a broad host range, including the human gut40,49,50,,50. The frequent introgression events observed between these enteric bacterial species may also be driven by ecology. Introgression can potentially be a driver of ecological adaptation, although our analysis did not reveal that some functions are more commonly introgressed than others. However, the ecological information that we could derive from our dataset was very limited, and a robust analysis of bacterial ecology would need to be conducted to decipher the impact of bacterial ecology on introgression. In addition, other factors are likely at play. Higher levels of introgression could be the result of more permissive genetic machineries of recombination and the presence and diversity of vectors capable of mediating gene flow. For instance, Neisseria and Haemophilus are capable of transformation and recognize species-specific DNA motifs51,52,53,54,55,56. These motifs play crucial roles in homologous recombination by facilitating and controlling the process of exchange of genetic material within these genera. Similarly, Vibrio is capable of transformation and recognizes specific DNA motifs which are different from those described for other taxa57,58. Natural transformation within the genus Vibrio may play a large part in the frequent homologous recombination occurring between the Vibrio species59,60,61. Streptococcus' species have been extensively studied for their ability to engage in natural transformation, and their genetic diversity and adaptivity could be the result of the frequent introgression that we observed between species within this genus62,63.

DNA transfer is a mechanism that allows rapid adaptation, and introgression can potentially contribute to this process. However, based on the low number of functions preferentially being exchanged by introgression, our results suggest that introgression is not playing a substantial role in bacterial adaptation. Nevertheless, bacterial co-infections and super-infections involving related species may offer opportunities for adaptation mediated by introgression41,63,64,65. Moreover, introgression could potentially help fine-tune housekeeping genes by partially rewriting the sequence of these genes when adapting to a new environment. But despite some potentially beneficial effects, the adaptive advantage of introgression seems much more limited in comparison to the horizontal transfer of accessory genes.

As previously suggested16, our results confirm that introgression is most common between highly related species. This may reflect the fact that homologous recombination requires high sequence relatedness to be efficiently processed35,66. Regardless of species borders, introgression potentially occurs between related bacteria in ecological proximity until genome divergence becomes too substantial to be processed by the recombination machinery. The fact that many related species present high levels of introgression can also represent ongoing bacterial speciation. Indeed, speciation through the reduction of gene flow is a time-dependent process, and it is therefore difficult to draw a line based on sequence thresholds. According to some models, we expect that BSC-species could speciate through the progressive interruption of gene flow12,16,33,34. Under this scenario, the observed patterns of introgression would correspond to decreasing gene flow due to ongoing speciation. However, it can be argued that introgression and ongoing speciation represent two sides of the same coin.

Methods

Datasets

All analyzed genomes were downloaded from the GenBank database ftp.ncbi.nlm.nih.gov/genomes/. Lineages (n = 51) were selected at the genus level based on the genomes available included in our previous study16. All available genomes of each of those genera were downloaded from GenBank and included in the current study. This dataset initially included a total of 40,660 bacterial and archaeal genomes across 2112 named species according to species designations on the NCBI website. We then excluded highly closely related strains based on whole genome sequence comparison with a threshold of 99.9% average nucleotide identity (ANI) calculated using FastANI v1.3267. In addition, all the genomes that failed the taxonomic check status on NCBI were removed, and our final dataset resulted in a total of 23,853 genomes across 2044 named species (Supplementary Data 12). Protein-coding genes of each genome were extracted based on the annotations. Note that all species with different names at the time of this study were excluded (for example, 118 species from the genus Bacillus used in the previous study16 were re-named since, and those species were therefore excluded from this study). In addition, our overall dataset contained a single archaeal genus with 11 named species, and we therefore referred to our dataset as “bacteria” instead of “prokaryotes” to avoid generalization to all prokaryotes since our dataset includes a single archaeal genus. Finally, two genera, Escherichia and Shigella, are highly related39, sharing high sequence identity, and have been shown to be part of the same genus39,49,50. All species from these two genera were therefore grouped into a single bacterial genus (named “EscherichiaShigella”), resulting in a total of 50 genera in this study.

Definition of core genomes and phylogenetic trees

For each genus, the core genome was built using CoreCruncher as performed previously in ref. 68 with Usearch Global v8.069 and the stringent option. CoreCruncher was used because it can handle large datasets and because it includes a test to exclude potential paralogs and xenologs from the core genome (the inclusion of paralogs and xenologs into orthologous gene families becomes more likely as the number of genomes increases). Orthologs were defined with >70% protein sequence identity and >80% sequence length conservation, and all other parameters were set to default. The core genome was defined as the set of single-copy orthologs found in at least 85% of the genomes within each genus. Protein sequences of each core gene were then aligned using Mafft v7.40770 with default parameters. Protein alignments were then reverse-translated into their corresponding nucleotide sequences. Finally, the nucleotide alignments of all the core genes of each named species within the genus were concatenated into a single large alignment as previously described in ref. 71. Maximum likelihood phylogenomic trees were built from the concatenated alignment of the core genome for each genus using FastTree with the GTR model72. Branch supports were evaluated by generating 1000 bootstrap replicates using the same parameters. The trees were visualized with FigTree V1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).

Species definition using the ANI of the core genome

The core genome concatenates of each of the 50 genera were used to estimate the ANI of the core genome for all genome pairs. We used this method with a cutoff of 95% ANI, because the ANI of core genes is a slightly more stringent metric, as core genes usually evolve more slowly than accessory genes16,38. Pairwise ANIs of the core genomes were computed using the distmat tool of EMBOSS v.6.6.0.073, which calculates the pairwise nucleotide identities from the alignments38. Then, single linkage clustering was performed as previously described in ref. 38: all genome pairs with an ANI of the core genome cutoff of 95% or higher were clustered together into a de novo species (ANI-species) (Supplementary Data 13). While a 95% ANI threshold is widely used for bacterial species delineation, numerous studies have demonstrated that this cutoff is not universally applicable across bacterial genera74,75,76. Therefore, in this study, we adjusted ANI thresholds (i.e., 94−96%) for certain genera, including Aeromonas, Bordetella, Cronobacter, Klebsiella, Pseudomonas, Vibrio, Yersinia, Bacillus, and Burkholderia based on empirical data or taxonomic inconsistencies with the standard 95% ANI cutoff, which can lead to over-splitting or merging distinct species (Supplementary Data 11). These adjustments were supported by clustering patterns, known species taxonomy, and comparative genomic evidence. For example, genera like Streptococcus exhibit high intraspecies diversity which required slightly relaxed thresholds, while genera such as Bacillus and Klebsiella contain species requiring optimum cutoffs to prevent merging of closely related but phenotypically distinct species. This approach ensured greater alignment with established taxonomies and improved the biological relevance of resulting species clusters.

Inference of BSC-species

We tested for the presence of gene flow between the core genomes of pairs of ANI-species within each genus. Within each genus, all ANI-species with ≥15 genomes or more were used as “reference species”. Species with fewer genomes available could not be used as reference species for this analysis. Then, we compared each reference species against one randomly selected genome of all other ANI-species from the genus, which we named “candidate species”. For each comparison of a species against another one within the same genus, the core genome concatenate for the reference species + candidate species was re-built with CoreCruncher as described above to infer gene flow using the ConSpeciFix approach21. Each core genome concatenation was used to compute a distance matrix using RAxML version 8.2.12 with the GTR + GAMMA model77. From these distances, the ratio of homoplasic to non-homoplasic alleles (h/m) was computed for i) the reference species alone and ii) the reference species + the candidate species. Resampling analyses were conducted as previously described in ref. 21. From this step, graphs and statistics comparing h/m ratios between the genomes of each reference species with and without the candidate species were generated as previously described in ref. 21: The candidate species was inferred as a distinct BSC-species when a significant and substantial reduction of gene flow was detected based on h/m ratios (Wilcoxon test, P < 0.0001). When no clear and significant reduction of gene flow was observed, the reference species and the candidate species were considered as putatively part of the same biological species and further tested for convergent mutations (see below section).

Convergent mutation test

Because our procedure compares various genomes, some comparisons can occasionally involve species with substantial genomic divergence. As genomes accumulate mutations during divergence, the frequency of convergent mutations increases as well, and this leads to the accumulation of homoplasic alleles that are the result of convergent mutations rather than gene flow. To account for this, we simulated genome sequence evolution with CoreSimul78 for each dataset of reference species + candidate genome as described previously16. The resulting simulated candidate genome sequence obtained for each pair of reference species + candidate species was evolved in silico with mutations but without gene flow, and the simulated genomes were then used to estimate the ratio h/m0 expected to result from convergent mutations alone against the ratios estimated for the reference species (Supplementary Data 4). The estimated values of h/m0 were then compared to the real h/m values obtained between the candidate species and the reference species (h/mcand) as previously described in ref. 16. We considered cases where h/m0 was similar to h/mcand as cases where the signal of gene flow is driven by convergent mutations rather than gene flow (Wilcoxon test, P < 0.0001). In such cases, the reference and candidate species were considered as distinct species. This step was conducted for each pair of reference species and candidate species within each genus (19,357 comparisons).

Inference of introgression events

Introgression events were inferred based on the combination of phylogenetic signal and sequence identity. We used a phylogenetic approach to infer candidate introgression events, where we compared individual gene trees to core genome trees. All genera (n = 2) that have less than 100 core genes, were excluded for the inference of introgression. Introgression events were defined as sequences of core genes exchanged by homologous recombination between distinct species (the analysis was conducted separately for ANI-species and BSC-species). To infer introgressed core genes, unrooted trees were built for each core gene using RAxML77 with a GTR + GAMMA model for each genus with 100 bootstrap replicates to assess branch support. For each genus, the topology of each unrooted gene tree was compared to the topology of the unrooted genome tree. Each ANI-species and each BSC-species were tested for monophyly from the unrooted core genome tree. A few species were found to be non-monophyletic (n = 50), as inferred from the core genome phylogeny, and those were ignored for the analyses. A putative introgression event of a core gene was inferred when the sequences of the ANI-species (or the BSC-species) were not found to be monophyletic in the unrooted gene tree (see Fig. 1). In practice, each tree was rooted twice using two different species as an outgroup, alternatively, and a species’ gene was inferred as monophyletic if found monophyletic in at least one of the two trees. When putative introgression events were inferred, it was possible to unambiguously determine which two species engaged in introgression with one another (i.e., when the sequence of a single species was nested within the subtree of a different monophyletic species). For more complex cases, the origin of the introgressed sequence was not inferred and labeled as “unknown”. To further ascertain the inference of introgression events in our dataset, we restricted our analysis to pairs of BSC-species where both BSC-species have at least 15 genomes because we cannot infer BSC-species with high confidence when fewer than 15 genomes are available (Supplementary Data 3). Because gene flow is not necessarily bidirectional, we inferred introgression events between the two species when the two species were found not to engage in gene flow with each other. Moreover, incomplete lineage sorting (ILS) has been recently described in bacteria, and it has been shown that phylogenetic conflicts among gene trees can be the result of ILS79,80, and this mechanism can be confounded with introgression81. However, introgression is expected to leave a distinct signature from ILS, because introgressed genes involve the transfer of a sequence from one species to another, introgression events will theoretically lead to higher sequence identity of the introgressed sequence shared by the two species. In contrast, higher sequence identity is not expected under ILS. Therefore, the candidate introgression events inferred based on phylogenetic incongruencies were confirmed as introgression events when the tested gene displayed substantially higher sequence identity than the overall divergence of the core genome between the two species. For each pair of BSC species, we calculated the highest identity score for each core gene between the pairs of species, the average identity score of the sequences of that gene between the pairs of BSC species, and the standard deviation (SD). Similar trends of introgression were inferred between BSC-species for each genus when using a threshold of 2 mean SD, 3 mean SD, and 4 mean SD. The threshold of 2 mean SD was then selected to infer introgression events between BSC-species within each genus (Fig. 1). In addition, to determine the fraction of the gene sequences which was affected by introgression, we analyzed the sequence identity of introgressed genes as previously described in ref. 16. We ran a 100 bp scanning window along each introgressed gene sequence shared between BSC-species and calculated the average, the minimal, and the maximal nucleotide identity for (i) the genomes of the reference species alone and (ii) between the genomes of the candidate species and the genomes of the reference species as previously described in ref. 16. Only introgressed fragments presenting 100% sequence identity between the candidate and reference species were considered as introgressed, but other thresholds were also tested (see Supplementary Fig. 6). Although this analysis provides an estimate of the fraction of introgressed genes that have recombined, it likely underestimates these values since it can miss older introgression events.

Metadata extraction for introgression inference between ecological niches

We obtained genome assemblies from NCBI using the ncbi-datasets_download_script.sh command-line tool, which is part of the NCBI Datasets pipeline (www.ncbi.nlm.nih.gov/datasets). A list of accession numbers of the genomes studied here was provided as input, and the corresponding genome files were downloaded in GenBank Flat File Format (.gbff). Metadata was extracted from each.gbff file using the genbank_info.pl script, a component of the getSequenceInfo toolkit82, to obtain information on the genus, species, strain name, and isolation source of each genome. Based on this metadata, each species was assigned to its corresponding ecological niche. Introgression levels between niches were inferred as described above.

Functional annotation of core genomes into COG categories

To determine whether introgressed genes were biased toward certain functions, the core genes of each genus were annotated and grouped into functional categories by comparison to the COG (Cluster of Orthologous Genes) database using eggNOG-mapper v2.1.442. Within each genus, only the pairs of BSC-species sharing at least ten introgressed genes were analyzed to avoid biases. We also ignored categories A and B, since they were assigned to only 15 genes. We also excluded 385 genes that could not be annotated to any COG. For genes annotated as multiple COG categories, we included a separate observation of these genes for each COG category (i.e., the rows were duplicated for each annotated COG category). These pre-processing steps resulted in 35,351 entries corresponding to 33,187 unique genes across 24 genera. We then applied a logistic regression83 to these data, using the glm function implemented in R. Because bacterial genomes can display different functional compositions across genera, we set the genus as the first predictor variable, which controls for the effect of functional biases across genera on introgression rates before computing the effect of individual COG categories. This model was fit based on the genus and COG category variables transformed to be the sum-to-zero contrasts, which results in coefficients that are relative to the grand mean (i.e., the overall introgression rate), rather than the introgression rate for an arbitrarily chosen reference level for each categorical variable, which would be the default otherwise. One level of each variable is dropped when running this approach, and the coefficients and standard errors were computed afterwards based on the other coefficients per variable. This is done by taking −1 times the sum of all coefficients for all other levels of the variable for the dropped level’s coefficient, and through variance propagation to determine the standard error of the dropped level. We computed 95% confidence intervals as ±1.96 × the standard error for each coefficient. We applied a Bonferroni correction to adjust P-values for multiple testing (i.e., to correct for the 19 COG categories tested). All other BSC-species pairs (and genera composed only of BSC-species pairs sharing <10 introgressed genes) were excluded from the comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.