Introduction

Phylogenetic trees serve as the cornerstone for studies ranging from estimating the age of lineages1,2 to comparative genomics3 and microbial community ecology4,5, as they recapitulate the evolutionary history of species6,7. Inference of phylogenetic trees relies on identifying phylogenetic markers from homologous sequences that descend vertically from a common ancestor (orthologs)8,9,10. In addition to orthologs, microbes harbor homologous genes that have been exchanged through horizontal gene transfer (HGT)11,12. The deep divergence times of microbes, estimated at around 4 billion years1, have entangled homologs into complex relationships that obscure the precise identification of orthologs8,10. Nonetheless, new tree inference methods have been developed to bypass the identification of orthologs, thereby enabling the usage of all homologous sequences of a gene family as potential markers13,14. Because the downstream applications of phylogeny heavily depend on the tree quality, it is critical to select a combination of markers that yields the most accurate tree.

Metagenome-assembled genomes (MAGs) comprise the major genomic source of microbial diversity15. And yet, the gold standard 16S rRNA marker used for phylogenetic surveys of microbial diversity16 is rarely recovered from shotgun metagenomic sequences17,18. Moreover, 16S rRNA-based trees reflect only the evolution of the gene rather than the set of species19. To improve the tree accuracy, modern phylogenetic surveys have adopted a larger number of markers involved in housekeeping functions, such as ribosomal proteins or aminoacyl-tRNA synthetases20,21,22,23,24. These markers have been selected from a fixed collection of whole genomes spanning the major microbial phyla, thereby biasing the representation of markers toward well-characterized taxa. In contrast, MAGs seldom contain the entire genomic repertoire of a population25,26, and some even lack ribosomal proteins due to assembly errors27, reducing the number of markers available for tree inference. Therefore, no one-size-fits-all set of markers exists. To account for the novel diversity and heterogeneous quality of MAGs, marker selection needs to be flexible and tailored to the input genome collection.

Although the inference of microbial trees has shifted from using a single marker28,29 to multiple markers20,21,22,23,24,30, their selection is restricted to universal orthologous genes, which are commonly defined as being present in 90% of genomes and existing as a single copy in at least 95% of them31,32,33, severely limiting the number of markers considered. However, recent studies have shown that including gene families beyond standard universal orthologous genes leads to improvements in the accuracy of inferred trees13,14. While these studies highlighted the necessity for a comprehensive assessment of the phylogenetic signal from a broader set of genes, they did not attempt to select new markers. Here, we systematically select gene families to serve as markers for deep microbial phylogenomics tailored to the input genome collection. Leveraging recent advances in tree inference methods and genome annotation databases, we show that an expanded selection of markers yields species trees with higher accuracy than previous sets of markers in real-world datasets of whole genomes and MAGs derived from a wide range of environments. In addition to genes involved in replication, translation, and transcription, we found that markers have functional annotations related to metabolism, cellular processes, and environmental information processing, and all of them provide phylogenetic signals for tree inference. We also show that our automated, systematic, free-of-domain expertise, and tailored marker selection is robust against uneven taxon sampling and incomplete MAGs while remaining flexible in the number of markers to select and the choice of annotation database. Overall, we present a new method for Tailored Marker Selection (TMarSel), available as a software tool, that can be applied to modern genomic datasets, providing a foundation for more robust and accurate phylogenomic reconstruction.

Results

A vast yet unexplored gene family space for microbial phylogenomics

We surveyed a collection of 1510 whole reference genomes sampled across the microbial tree of life from the Web of Life 2 (WoL2) dataset to obtain an accurate representation of the gene family distribution in microbes. Specifically, we selected complete bacterial and archaeal genomes, ensuring that each taxonomic group, from phylum to family, had at least 10 taxa (Supplementary Fig. 1). We then annotated open reading frames (ORFs) of genomes into gene families using the KEGG and EggNOG databases, which resulted in 54% and 94% of ORFs annotated, respectively. KEGG gene families are scattered throughout genomes (Fig. 1a), ranging from universal to lineage-specific and from single-copy to multi-copy. The traditional criteria for marker selection are restricted to genes present in at least 90% of genomes and containing one copy in at least 95% of them31,32,33. We observe that only 1% of gene families annotated from the WoL2 genomes fall within the region defined by traditional criteria. This pattern highlights the limited number of gene families used for microbial phylogenomics. The limitation is further exacerbated in gene families annotated from 793 MAGs of the Earth Microbiome Project (EMP), despite a lower annotation rate, as only 47% and 87% of ORFs could be assigned to gene families with KEGG and EggNOG databases, respectively. This lower rate is mainly due to MAGs containing a large fraction of ORFs that cannot be assigned to gene families34. We observe that MAGs do not have gene families that conform to the traditional criteria. And yet, genomes and MAGs harbor, on average, 1289 and 846 gene families, respectively, that might add new phylogenetic signals to the tree inference process. EggNOG gene families share the same characteristics (Supplementary Fig. 2), suggesting that these trends are independent of the annotation database.

Fig. 1: Expanding the set of phylogenetic markers with TMarSel.
figure 1

a Summary statistics of KEGG gene families annotated from 1510 WoL2 genomes (top) and 793 EMP MAGs (bottom). The left panels show the gene family space defined by the proportion of genomes in which each gene family is present and the proportion of genomes in which the gene family exists as a single copy. Dashed orange lines represent traditional marker selection criteria (i.e., present in 90% of genomes and containing a single copy in at least 95% of them), and the orange region highlights the area defined by these criteria. Each dot corresponds to a gene family. The middle panels show the number of gene families outside and within the region defined by traditional criteria. The left panels show the distribution of gene families across genomes and MAGs, with dashed vertical lines indicating the arithmetic mean. b Illustration of the pipeline used for species tree inference (top) and detailed overview of TMarSel’s steps for marker selection (bottom). c A simple toy case with 5 genomes and all 32 combinations of gene presence (black squares) or absence (white squares). TMarSel iteratively selects k = 10 markers (crimson shade) that maximize the generalized mean of the number of selected genes per species, as denoted on top. Source data are provided as a Source Data file.

To assess whether these overlooked gene families can contribute phylogenetic signals to the inference process, we developed a robust yet flexible method, TMarSel, to select gene families annotated with the KEGG and EggNOG databases as markers for deep microbial phylogenomics. TMarSel takes as input a file mapping ORFs to gene families. The user can then control the copy number per genome and gene family using a threshold ranging from 0 to 1, where 0 includes all ORFs and 1 includes only the ORFs with the highest bit scores obtained during genome annotation. Next, TMarSel builds a matrix containing the copy number of gene families across genomes upon which it employs an algorithm to iteratively select k markers such that the generalized mean number of markers per genome is maximized (see Methods for details; Fig. 1b and 1c). TMarSel uses a single core and its runtime scales sublinearly with the number of markers, while the memory footprint remains constant (Supplementary Fig. 3). As a reference, selecting 1000 markers from WoL2 and EMP datasets takes 10 min and requires 10 GB of memory. The selected markers are fed into a standard pipeline to infer species trees (Fig. 1b). Briefly, for each marker, we generate a multiple sequence alignment, which is used to infer a gene tree. Because gene trees can include all the homologs of a gene family, we used the summary method ASTRAL-Pro235,36, which takes as input a set of gene trees and all their homologs, to infer a species tree. We then evaluate the quality of the inferred species trees as a direct measure of the goodness of the selected marker sets.

A well-balanced marker selection produces highly accurate trees in simulated data

TMarSel performs an iterative selection of markers tailored to the input genome collection. Its behavior can be customized with two parameters: the total number (k) of markers to select and the exponent p of the generalized mean. Practically, p biases the selection of markers toward families present in genomes with fewer gene families (if p < 0) or toward families present in genomes with more gene families (if p > 0; see Methods). To assess the impact these parameters have on the inferred trees, we first simulated a toy dataset of 50 gene families from 10 genomes across 25 replicates. In each replicate, we built a matrix containing the copy number of gene families across genomes (Supplementary Fig. 4a). We derived a species tree from the matrix using neighbor-joining over the Jaccard distances between genomes, upon which gene trees were also derived (see Methods). We then performed a parameter sweep for k and p and varied the maximum number of copies for each gene family and the noise present in gene trees. Noise refers to the proportion of leaves that have been randomly shuffled. We gauged the error in inferred trees as the normalized Robinson-Foulds (nRF) distance37 between inferred and real trees, where smaller distances indicate fewer errors and vice versa. Our simulations show that selecting a large number of markers reduces the error in the species trees, primarily when noisy gene trees are used for inference (Fig. 2a and standard deviations in Supplementary Fig. 5a). The parameter sweep shows that p ≤ 0 yields the species trees with fewer errors, with p = 0 as an inflection point. Moreover, we observe that having multiple copies of the same marker does not improve the inference process. Instead, they negatively impact quality, as errors increase with the number of copies.

Fig. 2: Impact of marker selection parameters on trees inferred from simulated data.
figure 2

a Contour plots of the mean normalized Robinson-Foulds (nRF) distance between inferred and real trees from simulated datasets of 50 gene families from 10 genomes across n = 25 replicates. b Contour plots of the mean nRF distance between inferred and real (top) and proportion of confident branches in inferred trees (bottom) from simulations of realistic prokaryotic gene families across n = 25 replicates. Marker selection was applied to each combination of the number of markers (k) and exponent (p). Each plot illustrates a scenario with different noise levels in gene trees (a, b) and maximum copies of gene families (only a). Source data are provided as a Source Data file.

We further assessed the impact of parameters on tree quality using a gene family space that resembles real datasets (Supplementary Fig. 4b). We simulated prokaryotic gene families using realistic duplication, transfer, and loss values38 across 25 replicates using the phylogenetic simulator Zombi39. Each replicate yielded a species tree, genomes for each taxon, and gene trees for each gene family. We then built a matrix containing the copy number of gene families across taxa, introduced noise to the gene trees, and performed a parameter sweep for k and p (see Methods). In addition to the nRF, we gauged the proportion of confident branches in the inferred trees as the number of branches with a Local Posterior Probability (LPP) score higher than 0.95 out of the total number of branches, as suggested in ref. 40. A higher proportion is an indicator of a more accurate tree topology. As in the previous simulations, our results indicate that a large number of markers reduces the error and reinforces p = 0 as an inflection point that achieves the lowest error in the inferred trees (Fig. 2b; standard deviations are provided in Supplementary Fig. 5b). The proportion of confident branches shows a similar pattern for k and p and is correlated with the error in the inferred trees. This indicates that trees with fewer errors have a higher proportion of confident branches and vice versa. The simulations also show that a saturation value occurs for k, at which the accuracy of inferred trees does not improve. In light of these results, we choose p = 0, which refers to maximizing the geometric mean number of markers per genome, for all subsequent analyses while varying k until reaching a plateau in quality.

Expanded marker selection improves the accuracy of the microbial tree of life

We next evaluated whether gene families annotated from the genomes of the WoL2 can serve as the foundational genetic elements for an accurate microbial tree of life. Because the simulation results suggested that multiple copies of the same gene family negatively impact the tree inference process (Fig. 2a), we first evaluated how many copies of each gene family should be included in the analyses. Using the bit score threshold assigned during genome annotation with KEGG and EggNOG databases, we controlled the number of copies of gene families (see Methods). Our results indicate that using a low number of copies yields the tree with the highest quality as measured by the proportion of confident branches (i.e., accuracy in topology) and nRF distance to the reference phylogeny (Supplementary Fig. 6). Consequently, we included only the ORFs with the highest bit score assigned to each gene family during genome annotation for marker selection.

We then benchmarked the performance of an expanded selection of markers for inferring the microbial tree of life. Fig. 3a, b show the quality of trees inferred using an increasing number of markers, ranging from k = 10 to 1000. For comparison, we also included four sets of previously proposed universal markers widely adopted for deep microbial phylogenomics1,23,30,41. Trees inferred with our markers exhibit the highest accuracy in topology and lower error to the WoL2 reference phylogeny compared to previous markers. However, the 400 universal markers from PhyloPhlAn3 achieve the lowest error, which is expected because the WoL2 phylogeny was reconstructed using the PhyloplhAn3 markers. And yet, the high quality of trees inferred from either KEGG or EggNOG gene families speaks for the robustness of TMarSel for selecting markers. Moreover, the smooth saturation curve in tree quality suggests that for any given k, our method can select the best set of gene families for tree inference. Filtering genomes with less than 25% of markers leads to trees with slightly lower quality across marker sets. Nonetheless, more markers per genome result in better species placement within the tree (Supplementary Fig. 7). Additionally, trees inferred from all marker sets exhibit clades consistent with the GTDB taxonomy (Supplementary Fig. 8), according to taxonomic accuracy metrics that measure the consistency between taxonomy and phylogeny (see Methods). Altogether, these results suggest that all marker sets recover known relationships among clades, while markers derived from KEGG and EggNOG gene families yield more accurate trees compared to previous sets.

Fig. 3: Evaluating the performance and functional makeup of TMarSel-selected markers on genomes from the WoL2 dataset.
figure 3

a Topological accuracy, measured as the proportion of confident branches, of inferred trees from 1510 WoL2 genomes. b nRF distance between inferred trees and WoL2 reference phylogeny. Left panels indicate the tree quality of previous marker sets, whereas middle and right panels show the tree quality of KEGG and EggNOG markers, respectively. Each dot corresponds to a tree inferred from different sets of markers (color-coded). Solid lines and filled symbols indicate trees derived from genomes containing at least 25% of markers, while dashed lines and unfilled symbols indicate trees from genomes with at least one marker. c Relative abundance of KEGG higher functional categories (color-coded) across markers selected from KEGG gene families. d Topological accuracy of trees inferred from taxonomically imbalanced datasets generated from the WoL2 genomes across n = 10 replicates and e nRF distance between inferred trees and the WoL2 reference phylogeny. Unfilled bars indicate that trees were inferred from genomes with at least one marker, and their color displays the marker set (color-coded). Data are presented as the mean ± standard deviations. Asterisks show the significance level of two-sided two-sample Mann-Whitney U tests after Benjamini–Hochberg correction between trees inferred from KEGG (left) and EggNOG (right) markers versus previous markers. Significance levels: n. s.: p > 0.05; *: p ≤ 0.05; **: p ≤ 0.01; ***: p ≤ 0.001. Source data and test statistics are provided as a Source Data file.

Because TMarSel iteratively selects more than twice the markers compared to the largest set available (i.e., 400 markers from PhyloPhlAn3), we investigated the functional classification of these new markers. Fig. 3b shows the abundance of higher functional categories of markers selected from KEGG gene families. 60%, 66%, and 53% of markers have functions related to genetic information processing when 10, 50, and 100 markers are selected, suggesting that genes involved in replication, transcription, and translation are preferred. Yet, as more markers are added, the selection shifts toward gene families annotated as carrier proteins, shape-determining proteins, and others that are not assigned to higher functional categories (see Supplementary Data 1 for a complete description). 38% of gene families are involved in metabolic functions, while 4% and 5% are involved in cellular processes and environmental information processing, respectively, when more than 600 markers are selected. We also observe an overlap in the functional descriptions of KEGG and EggNOG gene families (Supplementary Data 1), which speaks to the robustness of selection. These results highlight the diverse functional roles of these new markers that contribute to the phylogenetic signal.

The prevalence of mobile genetic elements (MGEs) in genomes from major microbial phyla42, suggests that genome annotations are likely to yield multiple MGE gene families. Therefore, we assessed whether gene families annotated as MGEs were selected as markers by surveying their functional description terms (see Methods). We found that putative transposases, integrases, and competence proteins were selected. Nonetheless, they account for less than 2% of markers and represent less than 13% and 6% of the total number of gene families annotated with MGE descriptions in the KEGG and EggNOG databases, respectively (Supplementary Fig. 9). This showcases the robustness of TMarSel against the over-selection of MGEs.

We also assessed the impact of imbalanced taxonomic distributions on the tree topology. To achieve this, we generated skewed datasets with many poorly represented taxa and a few overrepresented taxa at the phylum level, based on the WoL2 genomes, across 10 replicates (see Methods). We gauged the taxonomic imbalance of a dataset using the Gini index, as suggested in ref. 23. Index values range from 0 to 1, where 1 indicates a highly imbalanced dataset and vice versa. The WoL2 dataset has a Gini index of 0.59, while the generated datasets show an average Gini index of 0.66 ± 0.04, indicating a greater imbalance. For each dataset, we then selected an increasing number of markers from KEGG and EggNOG gene families and inferred trees from genomes containing at least one marker. We also included universal marker sets for comparison. As previously suggested, taxonomic imbalance decreases the quality of tree topology43. However, trees inferred from markers selected with our method show a higher quality than trees from previous sets. Specifically, as the number of KEGG markers increases, the average proportion of confident branches increases from 0.60 ± 0.05 to 0.78 ± 0.05 compared to 0.71 ± 0.08 from the best-performing marker set (two-sided two-sample Mann–Whitney U test, p = 0.121; Fig. 3d). The error relative to the WoL2 reference phylogeny decreases from 0.22 ± 0.02 to 0.18 ± 0.02, matching the error of PhyloPhlAn3 markers at 0.18 ± 0.02 (two-sided two-sample Mann–Whitney U test, p = 0.650; Fig. 3e), from which the WoL2 phylogeny was reconstructed. Trees inferred from EggNOG markers show the same trends. Overall, these results suggest that TMarSel markers mitigate the impact of taxonomic imbalance on tree quality, making it particularly suitable for metagenomic datasets where taxonomic imbalance is prevalent. For instance, the EMP dataset has a Gini index of 0.73, indicating a greater taxonomic imbalance than the WoL2 dataset.

Robust marker selection yields accurate trees for MAGs despite incomplete genomic data

Most of the microbial diversity comes from MAGs derived from environmental samples15. Microbial communities from environments as diverse as seawater, soil, sediment, or animal gut sequenced in the EMP dataset have contributed to the genomic corpus of microbial diversity44. To test whether marker selection with TMarSel yields accurate phylogenies for MAGs, we used 793 high-quality MAGs from the EMP dataset annotated with KEGG and EggNOG databases. Because no reference phylogeny exists for the EMP dataset, we evaluated the quality of the inferred trees using only the proportion of confident branches, which measures the accuracy of topology. Similar to the WoL2 results, we found that using the ORFs with the highest  bit score for each marker results in more accurate topologies (Supplementary Fig. 10). Despite the overall quality of trees being lower than those from the WoL dataset, the TMarSel markers selected from KEGG and EggNOG gene families produce trees with the highest accuracy compared to previous sets of universal genes (Fig. 4a). The increase in accuracy is more pronounced when more markers are used in the inference process (k ≥ 400), though a plateau is reached at k = 800. Moreover, filtering MAGs with less than 25% of marker yields trees with  ~5% more confident branches. Although the filtering step decreases the number of MAGs across all marker sets, trees inferred with TMarSel retain more MAGs than the next-best set of universal genes from PhyloPhlAn3 (Supplementary Fig. 11). The functional annotation and proportion of MGEs in TMarSel markers are congruent with those selected from the WoL2 dataset (Supplementary Fig. 12a, b).

Fig. 4: Performance of TMarSel-selected markers on MAGs from the EMP dataset.
figure 4

a Topological accuracy, measured as the proportion of confident branches, of trees inferred from 793 complete MAGs from the EMP dataset. Left panels indicate the tree quality of previous marker sets, whereas middle and right panels show the tree quality of KEGG and EggNOG markers, respectively. Symbols indicate trees inferred from different marker sets (color-coded), while line style and filled or unfilled symbols distinguish trees inferred from MAGs containing at least 25% of markers versus those containing only one marker. b Mean topological accuracy of trees inferred from simulations of increasingly 793 incomplete MAGs across n = 10 replicates. c nRF distance between trees inferred from incomplete MAGs to complete MAGs. The incompleteness threshold refers to the proportion of ORFs removed from each MAG. Unfilled bars indicate that MAGs contained at least one marker. Data are presented as the mean ± standard deviations. Asterisks show the significance level of two-sided two-sample Mann-Whitney U tests after Benjamini–Hochberg correction between the quality of trees inferred from KEGG versus PhyloPhlAn3 markers. Significance levels: n. s.: p > 0.05; *: p ≤ 0.05; **: p ≤ 0.01; ***: p ≤ 0.001. Source data and test statistics are provided as a Source Data file.

MAGs represent draft microbial genomes, but most MAGs do not capture all the genomic content of a microbe25. To assess whether accurate trees can still be inferred, we simulated increasingly incomplete MAGs across 10 replicates. The incompleteness threshold is defined as the proportion of ORFs removed from a MAG (see Methods). We then selected markers (k = 400) from KEGG-derived gene families, as they produced the most accurate trees. To obtain a comprehensive picture of the impact of incompleteness, MAGs with at least one marker were included. We then measured the accuracy in the topology of trees inferred from incomplete MAGs and the mean nRF distance (error) between trees inferred from complete MAGs and incomplete MAGs. We also included trees inferred with the 400 universal genes from PhyloPhlAn3 for comparison and performed Mann-Whitney U tests with multiple testing corrections to assess whether differences in tree quality were significant. As expected, the accuracy of the trees decreases with the incompleteness threshold because fewer gene families are available for selection. Nonetheless, on average, trees inferred from TMarSel markers have 8% more confident branches than trees inferred from PhyloPhlAn3 universal genes across incompleteness thresholds (Fig. 4b). In addition, the robustness of our markers is evident in the low errors of inferred trees, which increase slower at higher thresholds compared to the trees inferred from PhyloPhlAn3 markers (Fig. 4c). Overall, these results suggest that a tailored marker selection from MAGs yields more accurate trees compared to previous sets of markers, even when MAGs lack a fraction of ORFs.

Discussion

In this study, we assessed the phylogenetic signal of gene families for microbial phylogenomics using TMarSel, a robust yet flexible method for marker selection. We show that an expanded and tailored selection of markers can improve the accuracy of phylogenetic trees across simulated and real-world datasets of whole genomes as well as incomplete MAGs.

TMarSel provides a systematic exploration of the gene family space because it expands the source of markers to gene families instead of only orthologs. Among the currently available methods that model the evolutionary histories of gene families for tree inference, only ASTRAL-pro2 scales efficiently with a large number of genomes and genes14, allowing us to assess the impact of different parameter combinations on the quality of the species trees. Although ASTRAL-pro2 only accounts for gene duplication and loss, its quartet-based approach is robust against HGT45. Moreover, the taxonomic consistency of clades, from the phylum to the genus levels, in trees inferred from TMarSel markers speaks for the robustness of our results despite the pervasiveness of HGT among microbes. Because we select gene families as markers, TMarSel is also compatible with other tree inference methods that actively account for gene duplication, loss, and transfer46,47,48.

While previous marker sets were selected from a pre-defined collection of sequences and had a fixed number of markers, we offer more flexibility in both aspects. First, we rely on functional databases to obtain gene families from the input genomes, upon which TMarSel is applied. Our results show that markers selected from either KEGG or EggNOG yield trees with similar accuracy. This suggests a potential generalization to alternative databases, such as PFAM, UniRef, and MetaCyc. The choice of the database may depend on the focus of the study and downstream applications. For instance, KEGG links gene families to metabolic pathways, chemical reactions, enzymes, and other high-level functions49,50,51. Whereas EggNOG, in addition to functional annotations, provides evolutionary details of gene families52,53. Because TMarSel relies on functional annotations, ORFs not assigned to gene families, the so-called microbial dark matter34, are not considered during the selection of markers. Therefore, novel potential marker genes are omitted. The prevalence of the microbial dark matter is more pronounced on MAGs, as only 49% and 87% of ORFs have been annotated with KEGG and EggNOG databases, respectively. However, an increasing number of databases are being developed to identify a greater fraction of the microbial dark matter54,55,56, which TMarSel can leverage to provide a state-of-the-art set of markers. Moreover, since TMarSel requires a file mapping ORFs to gene families, users may also provide gene families inferred de novo. It is worth noting, however, that the quality of families can greatly impact the tree topology. Therefore, we suggest prioritizing the use of sound gene family inference methods.

Regarding the second aspect, rather than setting a fixed number of markers, TMarSel can select the best k markers despite taxonomic imbalance (Fig. 3d, e). We show that increasing the number of markers yields more accurate trees. The increase in accuracy follows a saturation curve that suggests an optimal range for k, which can be identified by selecting an increasing number of markers until reaching a plateau. Since our selection method follows an iterative approach, smaller sets are distilled versions of larger marker sets. The iterative nature of TMarSel also suggests that the first selected markers contribute more phylogenetic signal, as shown in the saturation curves of tree quality (Figs. 3a, b, and 4a). Therefore, if computational resources are limited, especially for tree inference, a small set of markers still yields accurate trees. It has been suggested that at least 30 markers should be used for microbial phylogenomics57. However, the minimum number depends on the input data, as 50 markers yield trees with different proportions of confident branches when genomes or MAGs are used. As a result, the minimum number of markers increases when inferring phylogenies from MAGs.

MAGs comprise a large portion of the prokaryotic diversity, and their representation in genomic databases will only grow as more environmental samples are sequenced15,58. However, the integration of MAGs into phylogenomic pipelines comes with challenges. First, as we have shown, MAGs do not have gene families that conform to the traditional criteria for selection. Second, even high-quality MAGs, gauged as complete by CheckM, do not contain the entire genomic diversity of a microbial population25,26, and some MAGs even lack ribosomal RNA and ribosomal protein genes due to assembly errors27. TMarSel effectively addresses the first challenge, as it can systematically explore the gene family space and select the best k markers tailored to the input set of MAGs. The second challenge is more complex because incomplete MAGs inherently lack genomic data. And yet, we show that it is still possible to select markers with strong phylogenetic signals, yielding trees with high accuracy while allowing for partial recovery of the tree topology.

TMarSel effectively selects the optimal combination of markers from gene families that serve as the foundational genetic elements for inferring accurate phylogenetic trees from a tailored input of whole genomes and MAGs, despite taxonomic imbalance and incomplete genomic data. Since TMarSel primarily relies on the presence-absence patterns of gene families, it remains agnostic to the taxonomy of the input genomes or MAGs, as well as functional constraints, allowing for the selection of markers with diverse functional roles. We also show that marker selection can be agnostic to evolutionary rate, alignment quality, and non-vertical evolution. However, further evaluations of these properties are desirable to identify rogue markers that are detrimental to the inference process. In addition, we have only assessed the performance of our method for taxa spanning across the microbial tree of life. And yet, the robustness of our results indicates a potential generalization to fine-grained taxonomic groups, as well as other annotation databases and tree inference methods.

Methods

Marker gene selection

We represented gene families U = {u1, . . . , um} across genomes V = {v1, . . . , vn} as a 2D matrix Am×n where entries are positive if gene family ui exists in genome vj and its values correspond to the number of times the gene is identified in the genome (i.e., copy number). Gene families not existing in a given genome were represented with a zero in A. To select a set of k marker genes G = {g1, . . . , gk}, k < m, we devised an algorithm that, in each k iteration, selects the gene g that maximizes the objective function: \(\arg \max \left({\left(\frac{1}{n}{\sum }_{j=1}^{n}{\left({{{{\bf{c}}}}}_{k-1}+{{{{\bf{A}}}}}_{i,j}\right)}^{p}\right)}^{\frac{1}{p}}\right)\), where c1×n is a cumulative vector containing the copy number of genes already selected that were removed from A (see Appendix 1 in Supplementary Information). Note that our cost function is analogous to the generalized mean with exponent p. Small values of p shift the cost function toward small values, selecting genes present in genomes with fewer gene families. In contrast, large values of p will select genes from genomes with many genes. Because A contains zeroes, our cost function will return zero for p ≤ 0 (Supplementary Fig. 4A). To avoid this issue, we added a pseudocount of 0.1 to A when selecting marker genes for all values of p.

To assess the wall clock time and memory usage of our selection method, we simulated multiple copy number matrices with a varying number of gene families (m = 50 to 100, 000) and genomes (n = 707 to 50, 118) and a maximum copy number of five following the protocol described in the next section. For each combination of m and n, we selected an increasing number of markers from k = 10 to m − 1 if n < 10,000 or up to k = 10, 000 otherwise. All steps were executed on a computer with an Xeon E5-2697 v2 processor, using the time command to record both statistics. The scaling exponent of each line depicting the computational requirements as a function of the number of markers was calculated using the curve fit function implemented in SciPy v1.11.359. For this, we fitted a power law of the form kb to the data and took the average exponent across m, where k is the number of markers and b is the scaling exponent (Supplementary Fig. 3).

Toy simulations

To assess the impact of parameters p and k, we simulated multiple copy number matrices as follows: (i) Fix the number of gene families (m = 50) and genomes (n = 10). (ii) For each genome, randomly draw a number between zero and one from a uniform distribution to represent the fraction of gene families in the genome. For instance, a value close to zero corresponds to a genome with few gene families and vice versa. (iii) For each genome, generate a presence-absence vector of gene families, where a value of one (i.e., presence) is sampled with a probability equal to the fraction of gene families from step ii. In contrast, a value of zero (i.e., absence) is sampled with a probability of 1 - fraction of gene families. Thus, step iii produces a presence-absence matrix of gene families across genomes. (iv) To add copy numbers over the presence-absence matrix from step iii, we sampled a number between one and a maximum number of copies, set beforehand, for each gene present within a gene family. All steps were implemented using NumPy v1.26.060 while fixing a seed for reproducibility. This procedure was repeated for 25 replicates, yielding 25 different copy number matrices later used to select marker genes. For each replicate, we then performed a parameter sweep for 11 values of k and p ranging from k = 1 to k = m − 1 and from p = −100 to p = 100 (Supplementary Fig. 4a).

To generate a species tree from the copy number matrix, we first calculated the Jaccard distance for each pair of genomes to obtain a distance matrix. We then applied neighbor-joining implemented in Scikit-bio v0.6.2 over the distance matrix to generate a species tree. To obtain gene trees, we took the species tree as a template, and for each gene family, we removed the genomes that did not contain the gene family. Hence, at this stage, gene trees contain only a single copy of each genome. We then added gene copies as sister branches of a given genome. Thus, for example, if a gene family is present in four out of ten genomes with copy numbers ranging from one to five (e.g., 1, 3, 5, 2), the resulting gene tree will have 11 leaves in total from four genomes, representing all the copies that stem from duplication or transfer events. Because we do not infer gene trees from sequences, we introduced noise into gene trees to simulate uncertainties that may arise during multiple sequence alignment and impact the gene tree inference process. Specifically, noise here refers to the proportion of branches that are randomly shuffled. Scikit-bio v0.6.2 was used to manipulate trees.

Realistic simulations

Because the toy simulations are unrealistic, we simulated realistic prokaryotic gene families using Zombi39. We first generated a species tree under a birth-death model for 500 units of time with speciation and extinction rates per unit of time set to 0.04 and 0.03 per unit of time, as suggested by Louca et al.61. We then simulated genomes along the branches of the species tree using 10,755 and 5577 bacterial and archaeal gene family-wise values of duplication (D), transfer (T), and loss (L) benchmarked in ref. 38 while leaving the rest of the parameters as default. At the end of the simulation, Zombi outputs the real species tree and the genomes of each taxon along with real gene trees of each gene family. From the genomes, we built the copy number matrix that was later used to select marker genes. We repeated the simulations for 25 replicates while fixing a seed for reproducibility. For each replicate, we then performed a parameter sweep for 11 values of k and p ranging from k = 1 to k = m − 1 and from p = −100 to p = 100. We also introduced noise to the gene trees to simulate uncertainties that may arise during multiple sequence alignment.

Web of Life 2 and earth microbiome project datasets

The Web of Life (WoL2) contains 15,953 prokaryotic species with a single representative high-quality genome and taxonomic annotations from the Genome Taxonomy Database (GTDB) R207, as well as a reference phylogeny62. The WoL2 dataset is publicly available at https://ftp.microbio.me/pub/wol2/. Due to the burden of computational resources, we selected complete reference genomes from the WoL2, ensuring that each taxonomic group, from phylum to family, had at least 10 taxa. This resulted in 1510 genomes spanning the entire microbial diversity that will be used for genome annotation and marker selection.

The Earth Microbiome Project (EMP)44 contains 811 high-quality Metagenome Assembled Genomes (MAGs) from 32 environments across the globe and are publicly available at https://www.globus.org/, endpoint emp_500_public. After genome annotation (see below), we inspected whether the number of Open Reading Frames (ORFs) matched the number of gene families. We discarded 18 poorly annotated MAGs with a low number of gene families compared to the number of ORFs. This resulted in 793 MAGs that will be used for all subsequent analyses.

Genome annotation

To generate the copy number matrix for marker gene selection, we need to annotate the genomes and MAGs. First, we used Prodigal v2.6.363, in single-genome mode with the genetic code table specified according to taxonomy, to predict ORFs for genomes of the WoL2. ORFs from MAGs of the EMP dataset were predicted with Prokka v1.14.664. We then used the KEGG Orthology release 102.0+51 and evolutionary genealogy of genes: Non-supervised Orthologous Groups (EggNOG) v5.052 databases to annotate the ORFs into gene families with KOfam-Scan and EggNOG-mapper v265, respectively. In the KEGG annotation, we selected only prokaryotic KEGG Orthologs (KOs) with an e-value lower than the threshold defined in the database66, thus minimizing false positive assignments. In the EggNOG annotation, we only selected the assignments at the broadest taxonomic level since we are interested in inferring a tree for diverse microbial species. The annotation of ORFs into gene families was then used to generate a copy number matrix Am×n with m gene families and n genomes.

We used the hierarchical classification from KEGG to map KOs to higher functional categories. For this, we counted the number of times a given gene is mapped onto a functional category. The counts in each category were then normalized by the total number of counts and multiplied by 100 to obtain a relative abundance. EggNOG only provides a functional description for each gene family. Therefore, we could not map its gene families to higher functional categories.

In addition, we identified Mobile Genetic Elements from KEGG and EggNOG gene families using the following description terms: baseplate, capsid, excisionase, DUF4102, pf00665, KilA-N, ORF11CD3, phage, portal, tail, terminase, tape, T5orf172, viral, virion, conjugal, conjugation, conjugative, DotD, IV secretory, IV secretion, MobA, mobilization, mobilization, MobL, DUF955, plasmid, relaxase, TcpE, TraG, TraL, TraM, DDE, pf01609, IS66, IstB, transposase, transposon, transposition, anti-restriction, antirestriction, integrase, integration, K02238, K02242, K02243, K02244, K02245, K02246, K12296, K04096, K06198, K07343, as suggested in ref. 67.

Controlling copy numbers

Since ORFs mapped to gene families by KOfamScan or EggNOG contain summary statistics, we focused on the bit score value to control the number of copies of a gene family present within a genome. We used the bit score rather than the e-value, as the former is independent of the database size. For every genome, we identified the maximum bit score assigned to each gene family. We then excluded ORFs that had a bit score below a certain threshold that represents proximity to the maximum bit score. For example, a threshold of one will only include the best hits of each gene family per genome. In contrast, a zero threshold will include all the ORFs. Thus, the threshold represents the number of gene copies of each gene family per genome to include for marker selection.

Impact of taxonomic imbalance on tree topology

We first calculated the Gini index to measure the taxonomic imbalance, as suggested in ref. 23, at the phylum level. Index values range from 0 to 1, where 0 indicates a perfectly balanced set and vice versa. We observed that the 1510 genomes from the WoL2 belong to 17 bacterial and archaeal phyla, and have a Gini index of 0.59.

To generate datasets with a Gini index greater than the WoL2 tested in the manuscript, we focused on the entire WoL2 dataset. We set the maximum number of genomes to 500 and the number of phyla to 17 for each dataset. We then sampled the genome abundance of each phylum from a β-distribution with parameters a = 0.1 and b = 1, ensuring highly right-skewed distributions where a few phyla contain the majority of genomes. The minimum genome abundance was set to four because the ASTRAL-Pro2 inference method is based on quartets. For each abundance, we then randomly sampled a phylum and checked whether the phylum had enough genomes to sample. If yes, then we randomly sampled genomes according to the abundance. Otherwise, we sampled another phylum until it had enough genomes. This procedure was repeated for ten replicates while fixing a seed for reproducibility. The generated datasets have an average Gini index of 0.66 ± 0.04, indicating a greater imbalance than the WoL2 dataset, but lower than the EMP dataset, which has a Gini index of 0.73.

For each dataset, we selected an increasing number of markers (k = 50, 100, 200, and 400) with exponent p = 0 from KEGG and EggNOG gene families with TMarSel. We also extracted previously proposed markers for comparison. Species trees were inferred with the pipeline described below. The quality of trees was then assessed as described below. We also performed two-sided two-sample Mann–Whitney U tests with multiple testing correction using the Benjamini–Hochberg method to test for significant differences in quality. Scikit-bio v0.6.2 was used to calculate the Gini index, while Scipy v1.11.3 was used for statistical analyses.

Pipeline for species tree inference

Once markers have been selected, we retrieved all the ORF sequences associated with them. Depending on the experimental condition, we excluded those genomes with less than k × 0.25 marker genes. Protein sequences of each marker gene were then aligned using UPP268 with default parameters. UPP2 is a multiple-sequence aligner designed for large datasets containing sequence length heterogeneity that might arise under large insertion or deletion events or due to incomplete assembly69,70. In the first stage, UPP2 selects a set of full-length sequences and computes a backbone alignment and an unrooted tree. A collection of profile Hidden Markov Models (pHMM) is then built for each subset of sequences in the backbone tree, and the remaining sequences are inserted into the backbone alignment. We chose the optimal number of backbone sequences using the backbone query split algorithm from UPP2. Data pipeline errors, such as sequencing, assembly, genome annotation, or alignment errors, substantially impact phylogenetic reconstruction as they increase the noise in the dataset. We used TAPER71 with default parameters to remove such errors within alignments. TAPER is an outlier section algorithm that removes amino acids based on a divergence score computed along genomic positions and species71.

We used the clean alignments for Maximum Likelihood tree reconstruction with FastTree v2.172 using the Lee-Gascuel (LG) model of amino acid substitution, as a previous study showed that the LG model best explains substitutions in a majority of prokaryotic marker genes62. This step produced a set of k gene trees. We used TreeShrink73 to remove pipeline errors that have escaped detection at the sequence level and resurfaced as suspiciously long branches in the phylogeny. TreeShrink removes leaves that inflate the tree diameter, defined as the maximum distance between any two leaves in the tree73. In our case, a leaf corresponds to a protein sequence of a gene within a genome.

We then used ASTRAL-Pro236 to estimate a species tree from the set of gene trees. ASTRAL-Pro2 combines information from gene trees to maximize a measure of quartet similarity between them and the species tree35. The quartet similarity measure is defined as the fraction of partitions between all combinations of four species, with a common ancestor originating from a speciation event that shares the same branching structure (i.e., topology) as the real species tree under gene duplication and loss scenarios35. Because only speciation events count toward the measure, ASTRAL-Pro2 identifies orthologs from gene trees containing all the homologs of a gene family35,36. This flexibility is ideal because the genome annotation step produces gene families that contain orthologs, paralogs, and xenologs.

Sets of previous markers

To benchmark the usefulness of our marker genes for inferring phylogenetic trees, we compiled four datasets of marker genes previously proposed and used for deep microbial phylogenomics. (i) 41 single-copy universal marker genes tested for their phylogenetic signal in multidomain phylogenetic reconstruction23. (ii) 57 single-copy universal markers, which were used to estimate the age of the Last Universal Common Ancestor1. In the case of the 41 universal markers from Martinez-Gutierrez, the pHMMs of each gene were already provided, whereas, for the 57 universal markers from Moody, we built the pHMMs from the multiple sequence alignments with hmmbuild from HMMER v3.474. We then used hmmsearch to search for homologs of the markers in the genomes of the WoL2 and MAGs from the EMP dataset. For each gene, only the ORF with the top bit score was extracted and used in the pipeline for species tree inference. Other marker sets we benchmarked are: (iii) 136 universal markers from AMPHORA230 available in PhyloPhlAn341, and (iv) 400 single-copy universal markers first proposed in refs. 24,75 and part of PhyloPhlAn3. To identify homologs of these two last sets of universal genes on the WoL2 and EMP datasets, we ran PhyloPhlAn3 v3.1.68 with high diversity and fast parameters. We then took the identified ORFs and used them in the pipeline for species tree inference. This approach ensures that species trees from all marker genes were inferred using the same pipeline, thus guaranteeing a fair comparison.

Quality of species trees

To evaluate the quality of species trees inferred with our marker genes, (i) we calculated the Robinson-Foulds distance between the inferred tree and the WoL2 reference phylogenetic. The RF distance counts the different number of ways to divide a set of taxa by removing a branch37. We further normalized the RF distance by the sum of internal branches between the two trees to obtain a number between zero and one, where zero indicates two identical trees and vice versa. DendroPy v4.6.176 was used for all tree distance calculations. (ii) We also measured the quality of the inferred trees using the Local Posterior Probability (LPP) scores from ASTRAL-Pro2, which is a measure of confidence of each branch based on gene tree quartet frequencies. An LPP higher than 0.95 was used to classify a branch as highly confident, as suggested in ref. 40.

(iii) We gauged the taxonomic consistency of clades in a phylogenetic tree using the taxonomic accuracy metrics from ref. 75 and the standard microbial GTDB taxonomy R207. Taxonomic precision captures the notion that phylogenetically closer species must share a common taxonomic label. The precision of a clade is calculated as \(\frac{{L}_{C}}{{L}_{LCA}}\), where LC is the sum of branch lengths of the members of the clade (C) and LLCA is the sum of branch lengths of all the descendants of the Lowest Common Ancestor (LCA) of the clade. Thus, if all the members of the clade form a monophyletic subtree, the precision is one (highest possible). If any member of the clade is scattered, the denominator grows faster than the numerator, thereby reducing the precision. Taxonomic recall, on the other hand, captures whether taxonomically similar taxa are grouped close in the phylogeny. The recall of a clade is calculated as \(\frac{\left\vert \{c\,\in \,C\,:\,d(c,\,{{{\rm{LCTS}}}})\,\le \,{D}_{{{{\rm{LCTS}}}}}\}\right\vert }{| C| }\), where d is the distance between a member of the clade c and the Longest Consistent Taxonomic Subtree (LCTS) and D is the diameter (i.e., the branch length separating the most distant leaves) of the LCTS. The recall, therefore, calculates the number of taxa outside the LCTS. The LCTS is defined as the internal node with the largest number of children that are part of the clade and are monophyletic themselves.

(iv) In addition, we assessed the placement of each taxon from an inferred tree relative to the reference tree. We first calculated a distance matrix from the number of edges (i.e., internal branches) connecting every pair of taxa in the inferred and reference trees. Since each row in the distance matrix represents a vector containing the distance of a taxon to the other taxa in the tree, we calculated the cosine distance between the vectors of the same taxa in inferred and reference trees. Thus, if a taxon has the same placement in the inferred tree as in the reference tree, the distance vectors have the same direction, resulting in a cosine distance of zero. We expect well-placed taxa to have lower cosine distances to the reference tree. To assess the overall trend of placement, we applied a least squares regression implemented in SciPy v1.11.3.

Simulations of incomplete MAGs

To simulate genome incompleteness, we randomly sampled contigs and sequentially selected ORFs in each contig until satisfying an incompleteness threshold, defined as the proportion of ORFs to be removed. The threshold is specified as a proportion of the total number of ORFs in the MAG. For instance, a threshold of 0.1 in a MAG containing 1000 ORFs will result in 100 ORFs removed. Once all the ORFs to be removed had been selected, we removed them from the genome annotation files that relate ORFs to gene families produced by KOfamScan and EggNOG mapper. We then used the remaining gene families to build the copy number matrix for marker gene selection. For each incompleteness threshold, we repeated the simulation for ten replicates while fixing a seed for reproducibility. In each replicate, a different contig and, subsequently, ORFs were sampled. Thus, our simulations yield MAGs with different genomic compositions.

For each replicate, we selected markers (k = 400) from the incomplete set of MAGs and inferred a species tree. We also inferred trees using the 400 universal markers from PhyloPhlAn3 for comparison. The trees were then evaluated for the proportion of confident branches. In addition, we calculated the nRF between trees inferred from incomplete MAGs and trees inferred from complete MAGs. To assess whether the differences in quality between our markers and PhyloPhlAn3 markers were statistically significant, we performed Mann-Whitney tests and corrected the p-values with the Benjamini–Hochberg method implemented in SciPy v1.11.3.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.