Abstract
Coexisting strains of the same species within metagenomic data pose a substantial challenge to inferring transmission of pathogenic and commensal microbes. Here we present TRAnsmision Clustering of Strains (TRACS), a highly accurate algorithm for estimating genetic distances between strains at the level of individual single nucleotide polymorphisms, which is robust to intra-species diversity within the host. Analysis of faecal microbiota transplantation datasets and extensive simulations demonstrates that TRACS outperforms existing methods. We use TRACS to infer transmission networks in patients colonized with multiple strains, including severe acute respiratory syndrome coronavirus 2 amplicon sequencing data, deep population sequencing data of Streptococcus pneumoniae and single-cell genome sequencing data from patients infected with Plasmodium falciparum. Applying TRACS to gut metagenomic samples from a mother–infant cohort revealed species-specific transmission rates and identified increased the persistence of Bifidobacterium breve in infants, a finding previously missed owing to the presence of multiple strains. Our study shows that TRACS can be used across microbial kingdoms to uncover strain dynamics.
Similar content being viewed by others
Main
Host-to-host transmission is a fundamental process that shapes the interactions between humans and microbes. Tracking the spread of pathogens using genomics has become a major tool in public health, helping to prevent the spread of disease at both local and global scales1,2. Beyond pathogens, understanding the transmission and colonization dynamics of commensal microbes would greatly improve our understanding of microbiome assembly and maintenance, and how this is influenced by diet, lifestyle, culture, clinical interventions and social interactions. In addition, the human microbiome contains microbes with therapeutic potential to treat a variety of human diseases. The ability to identify candidate strains during clinical interventions such as faecal microbiota transplantation (FMTs) and live biotherapeutic products would greatly de-risk and accelerate the development of microbiome-based therapeutics.
Whole-genome sequencing (WGS) has transformed our ability to infer transmission chains by detecting single nucleotide polymorphisms (SNPs), enabling precise tracking of slowly evolving pathogens such as methicillin-resistant Staphylococcus aureus2. However, most WGS analyses focus on a representative genome from a single species, overlooking within-host microbial strain diversity3. Recent metagenomic approaches address this by enabling simultaneous analysis of multiple species and strains4,5. Deep population sequencing of specific species can also be achieved through targeted enrichment methods, such as culture or PCR amplicon sequencing, supporting fine-scale analysis of co-colonizing strains6,7,8. Existing gold-standard methods for tracking transmission using metagenomic or deep population sequencing data have typically been developed for curated academic studies5,9,10,11. While these methods offer valuable insights, such as within-host variant calling and identification of selection pressures via dN:dS, they lack the speed and flexibility required for routine public health monitoring. Notably, they often lack the temporal resolution needed to accurately distinguish between strains transmitted recently (weeks to months) and distantly related genomes (years).
Tools that rely on reference marker gene databases including MIDAS11,12 and StrainPhlAn9 consider only a small portion of a species genome (10–200 genes4) and do not attempt to separate within-species diversity. This substantially limits the temporal resolution at which transmission can be inferred. An alternative approach, used by StrainGE10 and inStrain (database mode)5, involves identifying the species found across the entire dataset and building a dataset-specific reference genome database for read alignment. This approach relies heavily on the similarity between the transmitted and reference genome and does not allow for the continuous integration of new samples into an analysis, making it unsuitable for routine genomic surveillance.
Another common approach relies on de novo metagenomic assembly, including the inStrain (assembly mode)5 and STRONG13 pipelines. Assembly requires high sequencing coverage of the transmitted genome10. These methods generally perform best when samples are pooled before assembly, followed by genome binning in a co-assembly workflow. However, to avoid extensive genome deduplication, which effectively turns the process into a reference-based approach, they must be applied to pairs or reduced subsets of samples. This can substantially increase the computational burden. Recombination and shared homology between strains of the same species within a sample is also generally not accounted for by existing algorithms, which can have a major impact on the accuracy of transmission inference when considering metagenomic and population sequencing data. To address these issues, we developed TRAnsmision Clustering of Strains (TRACS), a highly accurate and easy-to-use algorithm for establishing whether two samples are plausibly related by a recent transmission event.
The TRACS algorithm distinguishes the transmission of closely related strains by identifying genetic differences as small as a few SNPs, which is crucial when considering slow-evolving pathogens. The algorithm employs statistical filtering techniques to account for variable sequence coverage, shared homology between strains, and sequencing errors. Critically, TRACS was designed to estimate an accurate but conservative (lower bound) of the SNP distance and considers each reference alignment independently, enabling continuous integration of new samples. This algorithmic approach allows TRACS to function similarly to how SNP distances are used in conventional single-isolate genomic epidemiology studies, making it an ideal tool for accurately identifying putative transmission networks and ruling out transmission events in ongoing public health applications. However, similar to single-isolate studies, discerning fine-scale transmission structure, such as the direction of transmission, typically requires additional epidemiological data, such as contact tracing information.
We demonstrate using comprehensive simulations, and by considering well characterized FMT metagenomic datasets, that TRACS has superior performance to existing metagenomic transmission inference methods and can reliably identify plausible transmission events. We apply TRACS to a diverse set of pathogens including viruses, bacteria and parasites, demonstrating the scalability and versatility of the algorithm.
Results
Algorithm overview
The initial ‘alignment’ stage of TRACS uses the hash-based search algorithm ‘Sourmash’ to identify a set of reference genomes that best represent the species and their strains in a sample14. Unlike alternative approaches, TRACS avoids the error-prone deconvolution of reads into different genome bins and instead aligns the entire read set to each reference genome individually, generating a count of the alleles observed at each site15. This enables the incremental addition of new samples and the integration of additional species into the reference database without requiring the reprocessing existing data.
A set of statistical filtering algorithms are then applied to exclude regions affected by shared sequence homology, multi-mapping reads, poor alignment quality and low sequencing coverage (Fig. 1). This includes a scan statistic, similar to those used in phylogenetics to detect recombination16, that can identify regions with elevated polymorphism rates, often resulting from shared sequence homology between co-colonizing strains or gene duplications. TRACS also includes an empirical Bayes approach to account for regions of a reference genome with insufficient coverage to accurately represent multiple strains present within a sample (Methods). The resulting filtered alignments are converted into reference-based multiple sequence alignments (MSAs) for each reference genome. Finally, TRACS incorporates a fast, International Union of Pure and Applied Chemistry (IUPAC)-aware pairwise SNP distance estimation algorithm.
Left: the reads are aligned to each reference genome identified by Sourmash separately. An empirical Bayes method is used to pinpoint genome regions with insufficient coverage for minority strain identification. Centre: alignments from each sample are transformed into an MSA using IUPAC ambiguity codes to represent multiple alleles at a single site. Rapid pairwise SNP distances are then calculated, excluding potential recombination regions by identifying areas with high SNP density. The TransCluster algorithm can optionally be applied to estimate the expected number of intermediate hosts between two samples. Right: the resulting pairwise transmission distance estimates are clustered using single linkage hierarchical clustering to infer putative transmission clusters. Transmission distance thresholds are inferred using a mixture distribution to separate sample pairs that are known to be distantly related from those that include recent transmissions.
Optionally, TRACS incorporates sampling dates and a known transmission generation time to estimate the expected number of intermediate hosts between samples using an extended version of the TransCluster algorithm17 (Methods). While the alignment module of TRACS is designed for metagenomic and population sequencing data, MSAs produced by alternative tools designed for isolate data, such as Snippy18, can also be used directly as input.
A key challenge in using pairwise distance methods to infer transmission chains is determining an appropriate threshold for when two strains are probably connected by recent transmission. TRACS introduces a method that uses a mixture distribution to distinguish recent transmission from distantly related strains, leveraging both closely related samples (for example, from the same individual) and distantly related samples to derive more specific thresholds than previous approaches based on Youden’s index4,19.
The TRACS algorithm is implemented in python and C++, and is available under the MIT open source licence20.
TRACS shows strong results on simulated pairs
To assess the ability of TRACS to accurately estimate small genetic distances in pairs of samples containing multiple strains of the same species, we simulated WGS reads from mixtures of common Streptococcus pneumoniae strains. Common transmission SNP thresholds for S. pneumoniae are often <10 SNPs21. In each pair we selected one genome to be shared between samples at a specified SNP distance (5, 50 and 500 SNPs) ensuring a minimum average sequencing coverage of 5× for the transmitted strain. The TRACS algorithm was compared with InStrain, StrainGE and StrainPhlAn.
All algorithms except TRACS substantially overestimated genetic distances (Fig. 2a). This bias was larger among low-frequency strains (Extended Data Fig. 1), limiting the ability of these algorithms to reliably rule out recent transmission events, as commonly used SNP distance thresholds for many species fall well below the resulting inflated estimates21. StrainGE was the best-performing algorithm after TRACS, probably owing to its ability to account for minority strains and its use of competitive mapping among genomes within a species. This approach helps mitigate the effects of horizontal gene transfer, as demonstrated in the original StrainGE publication10. However, StrainGE only reported relatively accurate SNP distances at higher values (500 SNPs), which exceeds common thresholds (typically <50 SNPs) used to distinguish transmission in S. pneumoniae22. To determine which filter in the TRACS algorithm had the greatest impact, we evaluated combinations of the statistical filtering methods implemented in TRACS (Extended Data Fig. 2). The coverage filter, including the empirical Bayes algorithm, produced the largest improvement, followed by the recombination filter. Repeating these simulations with in silico Nanopore R10.4.1 reads further demonstrated TRACS’ compatibility with multiple high-accuracy sequencing technologies (Extended Data Fig. 3).
a, The relative error (destimated − dsimulated)/dsimulated, in the estimated pairwise SNP distance across four algorithms, when applied to a simulated mixture of pneumococcal genomes. A single genome is simulated to have been transmitted between each pair of samples with an SNP distance given on the x axis. Ten replicate simulations were performed for each parameter set with box plots indicating the distribution in relative error rates. Points closest to zero are the best performing, with positive values indicating an overestimation of the SNP distance and negative values indicating an underestimation. The central line represents the median, the edges of each box indicate the IQR and the whiskers extend to 1.5 times the IQR. b, The results of running each algorithm on artificial laboratory mixtures of pneumococcal strains from Knight et al.23. Ten of the resulting pairs of samples included at least one identical strain and thus the correct SNP distance should be zero. Except for TRACS, all algorithms erroneously estimate elevated SNP distances. The central line represents the median, the edges of each box indicate the IQR and the whiskers extend to 1.5 times the IQR. c, The simulated transmission of individual species was modelled between synthetic gut metagenome samples. Genomes were selected to have average nucleotide identities (ANI) of 100%, 99% and 97% relative to the GTDB representative genomes, which serve as the reference database for the InStrain, StrainGE and TRACS algorithms. StrainPhlan was run using the marker gene database (v4.0.5). Similar to the pneumococcal simulation, each of the ten species was simulated independently at three different SNP distance thresholds: 5, 50 and 500 SNPs. Only a single strain per species was included in each simulation. In addition to running InStrain in reference-based mode using GTDB representative genomes, InStrain was also run with custom reference genomes (InStrain assembly) generated for each sample pair using metaSPAdes v4.2.0. Contigs were ‘binned’ into species by aligning the assemblies back to the simulated genomes. The central line represents the median, the edges of each box indicate the IQR and the whiskers extend to 1.5 times the IQR.
To validate these results using real sequencing data with a known ground truth, we analysed 13 laboratory mixtures of different S. pneumoniae strains from a previous study23. In each case, at least one identical pneumococcal genome was shared between samples, meaning all algorithms should infer zero SNPs. As shown in Fig. 2b, TRACS was the only algorithm that reliably inferred zero SNPs in all cases. StrainPhlAn correctly identified zero SNPs in some cases; however, its reliance on marker genes can lead it to systematically underestimate SNP distances, as shown in the subsequent analysis of FMT data.
To further explore the reconstruction of transmission chains from metagenomic data involving multiple distinct species, we simulated genome mixtures representing common gut bacteria at varying levels of similarity to the reference database: 100%, 99% and 97% pairwise sequence identity (Methods). In these simulations, a single species identified in a previous study of transmission using metagenomics was simulated to transmit with a specified SNP distance24. A common reference database was used for all methods, with the exception of StrainPhlAn where the default marker gene database v4.0.5 was used. InStrain consistently overestimated SNP distances for most simulated pairs once the transmitted genome diverged by at least 1% from the reference database (Fig. 2c). Alternative strategies that generate study-specific reference databases, such as co-assembly of all samples or individual sample assembly followed by deduplication, could potentially improve these estimates. However, because inStrain relies on competitive read mapping, it is recommended that its reference databases be deduplicated so that any pair of genomes shares no more than 98% sequence identity. Consequently, in studies where multiple strains of the same species are present (a common occurrence in large datasets), strains that diverge from the assembled and/or deduplicated reference genome would still produce errors similar to those shown in Fig. 2c.
As an alternative, we explored generating sample-pair-specific reference genomes. In this mode (‘inStrain assembly’), metaSPAdes was used to assemble each pair of samples independently, producing a unique reference database for each pair (Fig. 2c). While this approach improved SNP distance estimates, its computational cost is prohibitive for larger studies. Excluding the computational cost of assembly, using pair-specific references requires the inStrain alignment and variant-calling steps to be run separately for each pair. This step alone required approximately 12.5 central processing unit (CPU) hours per pair (Extended Data Fig. 4). For a modest dataset of 100 samples, this would translate to 4,950 separate alignment steps and more than 2.5k CPU hours. Furthermore, if multiple strains of the same species were present within a given pair, this approach might still fail to assemble the transmitted strain as the reference genome.
Enhanced transmission estimates across diverse taxa
A major advantage of the TRACS algorithm is its ability to identify putative transmission events in cases where the hosts are colonized with multiple strains of the same species. This is a major concern in many high disease burden settings. Unlike alternative algorithms such as StrainPhlAn, the TRACS algorithm is applicable across diverse taxonomic groups including parasitic, viral and bacterial species. To demonstrate the effectiveness of the TRACS algorithm across a wide range of pathogens we considered well characterized datasets across three different settings.
SARS-CoV-2
Although infection with multiple severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains is relatively rare25, the rates of occurrence can be higher in locations with high disease burdens, such as within hospital wards, particularly when infection control procedures have broken down.
To demonstrate the ability of the TRACS algorithm to account for this challenges, we considered 37 of 1,181 SARS-CoV-2 samples, collected from the East of England in early 2020, which were processed using Illumina deep amplicon sequencing26. These samples were found to contain multiple distinct strains after sequencing each sample in replicate to account for sequencing errors27. These were compared with samples containing a single strain within the mixture. Thus, the smallest SNP distance between samples should be zero. To investigate the utility of applying TRACS to these samples, we compared the inferred SNP distance using TRACS with a consensus-based approach, with and without filtering for problematic sites such as hypermutable loci and regions impacted by the ends of amplicon sequencing reads28 (Fig. 3a).
a, A SARS-CoV-2 transmission network comprising samples that contain multiple distinct strains. Solid lines indicate transmission events observable using both TRACS and the typical consensus-based sequencing, while dashed lines represent additional links identified by TRACS alone. b, The inferred SNP distance between 21 pairs of 39 samples containing the same strain, where at least 1 sample includes multiple distinct strains. The consensus approach considers the dominant allele across the entire SARS-CoV-2 reference genome, whereas the ‘original’ approach excludes sites frequently filtered out to avoid hypervariable or error-prone regions. The TRACS algorithm correctly infers zero SNPs in all pairs. Box plots indicate the distribution of the inferred number of SNPS. The central line represents the median, the edges of the box indicate the IQR and the whiskers extend to 1.5 times the IQR. c, The expected number of intermediate hosts between each pair of 3,761 pneumococcal samples taken from different individuals in the Maela refugee camp versus the geographic distance between their homes within the camp with a LOESS smoothing. The shaded area indicates the corresponding confidence interval. Box plots show the median expected number of intermediate hosts. The edges of each box indicate the IQR, and the whiskers extend to 1.5 times the IQR. Sample pairs with a divergence time outside the establishment of the camp in 1984 were excluded. d, Similar to c, but only distances involving the three most common GPSCs found in 796 samples are shown. A strong geographic signal is absent for the non-multi-drug resistant lineage GPSC 1, which is known to have a longer carriage duration. e,f, The distribution of SNP distances inferred by TRACS for bulk versus single cell (e) and single cell versus single cell (f) samples. The vertical red line indicates the SNP threshold inferred using the TRACS mixture distribution approach. The high number of short SNP distances between single cells from the same sample indicates that TRACS can accurately distinguish closely related genomes within mixed infections of P. falciparum.
When the shared strain was in the minority, a consensus-based approach consistently overestimated the SNP distance between samples despite filtering out problematic regions. By contrast, TRACS correctly inferred 0 SNPs between samples in all cases without requiring manual filtering of problematic sites. Figure 3b indicates the transmission network inferred assuming 0 SNPs between strains for those mixed samples with a geographic location. The dotted edges indicate transmission links that would have been missed using a standard consensus-based framework. This example highlights the ability of TRACS to both account for the presence of multiple strains and to robustly control for the impacts of sequencing errors and hypermutable sites that frequently result in polymorphic variants within samples. TRACS is also likely to be robust to rare cases of recombination between strains within the host. In this case, ignoring polymorphisms could cause consensus methods to overestimate SNP distances, whereas, with sufficient coverage of both parent strains, TRACS would still detect shared strains between samples.
Streptococcus pneumoniae
An added benefit of the TRACS algorithm is its ability to estimate the expected number of intermediate hosts between two samples by incorporating sampling dates. Dates can be used as an additional piece of information to rule out transmission for species or lineages with low sequence diversity17.
To demonstrate this approach, we considered 3,761 nasopharyngeal swabs taken from 468 infants and 145 of their mothers living in a refugee camp in Thailand, where culture-based enrichment and whole pneumococcal population Illumina deep sequencing had been performed as part of a previous study7. TRACS was run on this dataset using a custom database of pneumococcal reference genomes from the Global Pneumococcal Sequencing project29. Consistent with previous studies, we assume a molecular clock rate of 5.3 SNPs per year and a transmission generation time of 2 months7. After running TRACS, the expected number of intermediate hosts between all pairs was compared with the geographic distance separating the households of participants (Fig. 3c,d). Although the original TransCluster algorithm was previously applied to this dataset, it only estimates individual probabilities, such as the probability of direct transmission, without estimating the overall expected number of intermediate hosts.
There was a strong correlation between geographic distance and the expected number of intermediate hosts, despite the small size of the refugee camp (2.4 km2). Interestingly, this correlation was not found for all the major lineages. In particular, common multi-drug-resistant lineages such as Global Pneumococcal Sequencing Cluster (GPSC) 1, had no clear association between transmission and geographic distance within the camp (Fig. 3d). These lineages are known to have longer carriage durations, which can obscure geographic transmission signals30,31. Thus, the observed signals are probably driven by lineages that transmit faster and exhibit shorter carriage durations, such as GPSC 20. Understanding the different dynamics governing the transmission of these lineages is crucial for the design of interventions aimed at reducing pneumococcal disease.
Plasmodium falciparum
The increased genome size of major parasitic pathogens has hampered the adoption of routine WGS in disease surveillance. However, the rapidly decreasing cost of sequencing is leading to the increasing use of WGS to track major parasite populations including the malaria parasite, P. falciparum, which causes over half a million deaths each year32. Multiple strains of P. falciparum are frequently found within people living in endemic areas with high burdens of disease33.
To investigate the ability of the TRACS algorithm to accurately identify shared strains within mixed P. falciparum infections, we considered a dataset involving 49 samples from Chikhwawa, Malawi, that were positive for P. falciparum34. Both Illumina bulk sequencing of mixed populations in addition to single-cell and single-clone enrichment had been performed as part of a previous study leading to 49 mixed WGS samples and 509 single genomes34. Although the sparse sampling makes direct transmission between samples unlikely, the combination of bulk and single-cell sequencing allows us to assess whether the TRACS algorithm can accurately detect shared strains within these mixed samples. Figure 3e,f indicates the distribution of the inferred SNP distances between single-cell genomes and bulk samples from either the same sample or between samples. The clear separation of SNP distances between genomes sequenced from the same sample and between samples indicates that TRACS can accurately identify shared strains within this data. As WGS of malaria parasites becomes more routine in public health settings, the TRACS algorithm can be used to accurately determine if a pair of samples could be related by a recent transmission event. A fine-scale understanding of transmission dynamics in high burden settings will aid in the design of strategies aimed at reducing and eventually eliminating malaria.
Improved engraftment estimates in faecal transplant triads
FMT involves transferring a donor stool sample to a recipient, often via orally-delivered capsules, and has demonstrated patient benefit to treat infections, autoimmune diseases, graft-versus-host diseases and cancers. The development of microbiome drugs often requires the identification of specific bacterial strains with beneficial properties from complex microbiomes containing hundreds to thousands of strains. To further evaluate the efficacy of the TRACS algorithm in metagenomics applications, we analysed an extensively characterized FMT dataset with well-defined transmission (or engraftment) relationships between samples35. The capability of each algorithm to detect engraftment links between samples from donors and recipients was examined by considering 23 previously published FMT triads35. Metagenomic samples were originally collected from donors and recipients before and after transplantation across three patient cohorts, including individuals with Clostridium difficile infections, inflammatory bowel disease and recurrent multi-drug-resistant infections.
The TRACS algorithm was run on each of these samples and the results compared with StrainPhlAn, InStrain and StrainGE (Fig. 4a–c and Extended Data Fig. 5). As StrainGE requires substantial computing resources to consider transmission of every observed genus, we restricted this analysis to bifidobacteria. The inferred number of engraftment events was then inferred, comparing samples from the same patient post-FMT and those from different cohorts (Fig. 4a). Species-specific SNP thresholds were chosen using both the mixture distribution method implemented in TRACS and the Youden index as has been used previously4 (Extended Data Fig. 6). No strain sharing is expected between cohorts, so inferred transmissions across cohorts are probably false positives. By contrast, bacterial strains are expected to persist within a patient’s gut over time, providing a real-world ground truth for method comparison. Strain persistence estimates may still produce false positives at rates similar to those seen between cohorts.
a, The inferred number of plausible transmission pairs (shared strains), inferred by each algorithm, between samples taken as part of study investigating the impact of FMTs. Transmission between samples from unrelated participants (shown in red) are highly likely to be false positives. By contrast, samples from the same participant following FMT (shown in black) are more likely to be true positives, with an error rate proportional to that observed in the unrelated participant results. Species-specific SNP thresholds for identifying recent transmission were inferred using the mixture distribution approach (Methods). An analogous plot using the Youden method, as described by Valles-Colomer et al.4, is given in Extended Data Fig. 6. b, Histograms indicating the distribution of pairwise SNP distances (truncated at 500 SNPs) between donor and recipient (post-transplant) samples for major bifidobacterial species. Vertical black lines indicate the SNP thresholds inferred using the TRACS mixture distribution method. InStrain identified the transmission of B. infantis, which is probably B. longum but mislabelled in the UHGG genome collection database, whereas StrainGE identified no transmission of B. longum. c, An example of multiple strains of B. longum being transmitted between a single donor and multiple recipients was detected exclusively by TRACS. The allele frequencies at polymorphic loci within a segment of the B. longum reference genome are shown for each recipient sample, with colours indicating the two distinct strains. The red strain is dominant in recipient sample SFMT_03_t15 but in the minority in a separate recipient (SFMT_27_t15) who received the same donor stool.
StrainPhlAn had the highest false-positive rate, which increased substantially using the Youden method for selecting SNP thresholds (Fig. 4a and Extended Data Figs. 5 and 6). Consistent with our simulation results, InStrain was the most conservative method, performing well when there is high sequence similarity between the reference and transmitted genomes. However, InStrain frequently overestimated SNP distances in cases of greater sequence divergence from the reference or when multiple strains of the same species were involved, as indicated by our previous simulation-based analysis. In contrast to alternative algorithms, TRACS achieved a high sensitivity, with a low false-positive rate.
To investigate this further, we considered the Bifidobacterium genus, and in particular the transmission of Bifidobacterium longum, a common pioneer colonizer of infant gut microbiota36,37, which is often used as an infant probiotic. Understanding strain persistence and transmission is crucial for developing effective microbiome-based therapeutics. InStrain and StrainPhlAn were run using their default databases, while StrainGE was supplied with a representative set of bifidobacteria reference genomes following the user guide (Methods). TRACS was run using the Genome Taxonomy Database (GTDB) as a reference database38.
TRACS was the most sensitive algorithm at detecting the transmission of B. longum, identifying 31 shared strains, 6 more than the next most sensitive method, StrainPhlAn, which identified 25 (Fig. 4b). InStrain and StrainGE proved to be the most conservative algorithms, while StrainPhlAn identified the highest rate of sharing. This high rate of sharing probably includes many false positives owing to StrainPhlAn’s reliance on a small set of representative genes. Despite having B. longum genomes present in the reference database, StrainGE failed to identify this species amongst the engrafted strains. By contrast, InStrain identified six B. infantis transplanted strains, as B. longum and B. infantis are considered as one species in InStrain’s default database (Unified Human Gastrointestinal Genome, UHGG).
Unlike competing algorithms, TRACS consistently identified cases of engraftment involving multiple strains of B. longum present within a single sample. One example is given in Fig. 4c, where the frequency of the engrafted strains was reversed in two patients who received the same donor stool. TRACS was the only algorithm to correctly identify that these patients shared similar strains.
Strain transmission and persistence in a UK birth cohort
The transmission and subsequent colonization events that drive the development of our gut microbiota can have important implications on our health in childhood and later life. In particular, the impact of perturbations driven by interventions in early childhood, such as caesarean section and antimicrobial treatment, have been associated with health complications including asthma and atopy39,40.
To characterize the transmission and persistence of bacteria at the strain level during early childhood, we analysed faecal samples from 1,288 healthy, full-term infants, sequenced as part of the UK BabyBiome Study36,37. Faecal samples were collected from all babies at least once during the neonatal period (≤1 month), with subsequent sampling from 302 infants, including 29 twins and 1 triplet, during later infancy (8.75 ± 1.98 months). Maternal faecal samples were also taken from 175 mothers, corresponding to 178 neonates. The TRACS algorithm was run on all samples using the GTDB reference database and SNP thresholds were calculated to distinguish recent transmission at the species level using the mixture distribution approach (Methods). Consistent with the low error rate of TRACS and stringent infection control procedures in each hospital, the occurrence of shared strains between unrelated children, both within the same hospital and across different hospitals, was very low (0.78% and 0.73%, respectively) (Fig. 5a). Interestingly, the highest rate of strain sharing occurred between siblings, reflecting their closely shared strain colonization history (from a common reservoir or between each other). Infants born via caesarean section exhibited markedly reduced strain sharing compared with those delivered vaginally (Fig. 5b), aligning with findings from previous studies36,37,41. Bifidobacterium bifidum, Bifidobacterium longum, Phocaeicola vulgatus, Parabacterioides disasonis and Escherichia coli showed some of the largest differences in transmission rates between mothers and infants born vaginally compared with those born via caesarean section.
a, A bar plot indicating the fraction of pairwise relationships that involve a putative recent transmission versus the number of shared strains. SNP distances and species-specific thresholds are inferred using the TRACS algorithm. Relationships between hosts are represented by different colours. The low false-positive rate of transmission is evident, as the vast majority of inter-hospital relationships involve zero transmissions. b, Species-specific transmission rates between 175 mothers and 178 babies. Data are presented as mean estimated transmission probabilities ± 95% confidence intervals. Point estimates and error bars are inferred using a binomial model of transmission if a mother is colonized (Supplementary Methods). Points are coloured by the mode of delivery (as in a), highlighting the species-specific impact of delivery. The total number of possible transmission pairs (mothers colonized) is shown in the accompanying horizontal bar plot. c, The rate of strain persistence by species from birth to 7 days, 21 days and late infancy. Data are presented as mean estimated persistence probabilities ± 95% confidence intervals. Point estimates and error bars are inferred using a binomial model similar to the previous transmission scenario. Persistence rates that differ significantly from day 7 are shown in red (two-sided Chi-squared test). Estimates are displayed only when at least 20 babies, who were originally observed to carry a species, are sampled again at respective time points. d, An example demonstrating the persistence of multiple distinct strains of B. breve in a single infant (ID: B00560) is presented. This example was not identified by StrainPhlAn, which only considers the dominant genotype. The frequencies of 1,000 randomly chosen alleles, found to be at intermediate frequency on day 4, are shown. Initially, the purple strain dominates, but by days 7 and 21, the green strain becomes dominant. e, Plotting of individual allele frequencies of two B. breve strains (from d) relative to a portion of the B. breve reference genome.
In addition to tracking strain transmission between hosts, TRACS can monitor the longitudinal persistence of strains within a single host. This includes observing the strain stability in infants, which was found to vary significantly between species (Fig. 5c). Strains of commonly pathogenic species such as E. coli and E. faecalis were maintained for relatively short periods of time and were generally eliminated or replaced by late infancy. The persistence of the pioneer colonizers in the neonatal gut, including B. longum and B. breve strains37, remained high during the first 3 weeks of infancy but declined markedly in late infancy, probably owing to changes in diet and decreased breastfeeding rates42,43. By contrast, some other maternally transmitted commensal genera, such as B. bifidum and closely related Phocaeicola, Bacteroides and Parabacteroides species, were maintained at higher rates for the entire duration of the study. Critically, the TRACS algorithm identified 40.5% (49/121) additional strain sharing events compared with StrainPhlAn. This indicates instances of repeated co-colonization of infants with multiple strains of B. breve, which were overlooked by algorithms such as StrainPhlAn that consider only the dominant genotype (Extended Data Fig. 7). For instance, Fig. 5d,e illustrates the persistence of two B. breve strains over three time points within a single infant. Initially, the purple strain is dominant with a mean allele frequency of 0.82 (±0.081) but then drops to 0.063 (±0.181) and 0.249 (±0.156), respectively, at later time points. Taken together, these results show that TRACS reliably identifies fine-scale strain transmission and persistence in the human microbiota.
Discussion
TRACS is a versatile, modular algorithm for inferring recent pairwise transmissions from metagenomic and single-species population sequencing. It corrects for sequencing errors, uneven coverage and shared sequence homology between strains, and accounts for multiple strains of the same species within a sample. Using simulations and artificial laboratory mixtures of S. pneumoniae, we show that TRACS has a higher accuracy compared with alternative algorithms.
The scalability and versatility of the TRACS algorithm allows it to be applied to a broad spectrum of pathogens including viruses, bacteria and parasites. This includes contexts where species-specific whole-genome bioinformatics transmission pipelines are yet to be developed. For example, although WGS is not typically used to track the transmission of malaria-causing parasites, we demonstrate using single-cell genomics, that by omitting the complex and error-prone step of deconvolution, the TRACS algorithm effectively identifies shared strains of P. falciparum between samples.
Beyond pathogen surveillance, TRACS can track the colonization and carriage dynamics of the human microbiota. Using metagenomic data from FMT studies, we demonstrate that TRACS identifies instances of minority strain transmission between FMT donors and recipients, as well as longitudinally sampled infants, which are not detected by alternative tools. In addition to confirming that caesarean sections substantially reduce mother-to-baby bacterial transmission, we found species-specific differences in colonization and persistence rates. Notably, E. faecalis and B. cellulosilyticus had higher transmission rates following caesarean delivery. As with other genomic transmission studies, it is essential to consider the broader epidemiology of host interactions. For example, in some cases, both mother and infant may be independently colonized by the same strain from the hospital environment or other unsampled sources, such as family members.
TRACS was developed to identify recent transmission events. However, the algorithm is not intended for identifying high-quality within-host variants, making it unsuitable for detecting selection through dN:dS ratios. Furthermore, as the algorithm provides only a lower bound estimate of transmission or SNP distance, it cannot accurately infer genetic relationships over thousands of SNPs, which represent much longer evolutionary timescales. For these purposes, tools such as inStrain and StrainPhlAn, respectively, are more appropriate. Similar to most minority variant-calling algorithms, TRACS requires sufficient sequencing depth to distinguish true variants from sequencing errors. Consequently, by default, TRACS requires a minimum coverage of 5× coverage to accurately estimate distances between strains. As with other metagenomic transmission inference methods, TRACS does not determine transmission direction. For this, we recommend using TRACS to identify potential transmission events, followed by more computationally intensive phylogenetic methods that incorporate additional epidemiological information8,44.
Metagenomics and deep population sequencing provide a powerful alternative to relying on single reference genomes for tracking microbial transmission and persistence. TRACS enables simultaneous analysis of multiple strains and species, offering insights into how microbes spread, establish and persist within their hosts.
Methods
TRACS algorithm
The TRACS algorithm consists of three central modules that can be used independently (Fig. 1). The ‘alignment’ module generates a polymorphism-aware alignment against each reference genome found within a given sample. Given either a multi-sample reference-based alignment or a traditional MSA, the ‘distance’ module estimates pairwise SNP distances. Optionally, the distance module can also estimate the expected number of intermediate hosts separating pairs of samples using an extended version of the TransCluster algorithm17. The TransCluster algorithm requires estimates of the transmission generation time (transmissions per year) and clock rate (SNPs per genome per year) of the species to be provided. Finally, the ‘cluster’ module estimates putative transmission clusters using single-linkage hierarchical clustering.
Alignment
The alignment module takes raw sequence data as input and produces alignments to multiple reference genomes in FASTA format. IUPAC ambiguity codes are used to represent polymorphic sites in each alignment, enabling the estimation of SNP distances between populations. The alignment module supports a variety of database formats tailored to specific applications. Users can provide a single reference, utilize a Sourmash database that incorporates Genbank genome IDs or opt for a custom TRACS database. The latter includes both reference genomes and a corresponding Sourmash database, which can be constructed using the TRACS ‘build-db’ command.
If a database containing multiple references is provided, the algorithm will first select those represented in a sample using the Sourmash gather command14. Minimap245 is then used to align all reads within a sample to each reference independently and a pileup of alleles identified at each site in the reference is generated using htsbox46.
To control for variable sequencing coverage, we developed an empirical Bayes method. This is applied to each genome alignment independently. The counts of minority variants at each site within a single reference alignment are modelled jointly using a multinomial Dirichlet distribution. This allows for the estimation of the posterior frequency of variants across an individual alignment by considering the average frequency of minority variants at polymorphic sites as follows. Assume ni1 is the read count of the most frequent allele at polymorphic site i in an alignment with ni2, ni3 and ni4 being the counts of the 2nd, 3rd and 4th most frequent allele, respectively. Our goal is to estimate a prior on the expected frequency of each allele at a given site by considering the observed distribution of allele frequencies across all sites in the dataset.
We assume a Dirichlet prior on the joint frequency (pi = (pi1, pi2, pi3, pi4)) of the counts ni, with parameter vector α such that the probability density can be written as
Assuming the resulting read counts at each site are drawn from a multinomial distribution, the marginal likelihood is the Dirichlet-multinomial distribution given by
where D indicates the vector of read counts at each site i: D = {x1, …, xN}, where xi = {ni1, ni2, ni3, ni4}.
Point estimates of the α parameters were made by maximizing the marginal likelihood using the fixed point iteration method47. The maximum a posteriori estimate of the frequency of each allele under this prior at each site can then be estimated as
After estimating posterior frequencies at all variable sites in an alignment, TRACS applies a frequency cut-off which requires a coverage of at least five reads by default. Sites falling below this threshold are then excluded from further consideration in subsequent estimates of SNP distances. Furthermore, to control for major outliers in sequencing coverage, regions with coverage exceeding 1.5 times the interquartile range (IQR) above or below the quartiles, as identified by Tukey’s method, are excluded48.
Distance
Pairwise SNP distances are estimated using a bitset-based algorithm to enhance both the speed and memory efficiency of the process, whilst incorporating IUPAC ambiguity codes. Each sequence in an MSA is represented as a matrix of bits (0s and 1s), where rows indicate the four possible alleles and columns indicate the position in the sequence. Sites where multiple different alleles are observed are then represented by multiple 1s within a single column. A series of bitwise ‘AND’ operations are used per allele, and the results are combined with ‘OR’ operations to rapidly calculate the sites which share common alleles between each pair of samples.
To account for the impacts of recombination and shared homology between strains, we include an optional additional step to identify regions within the pairwise comparison that have an elevated rate of SNPs. This uses a similar scan statistic to that used in popular recombination-aware phylogenetic reconstruction algorithms16. While this step may exclude sites that could be informative for transmission, SNPs introduced by recombination are rarely incorporated into public health genomic pipelines owing to the challenges they pose for phylodynamic analyses. In addition, the noise introduced by cross-mapping from shared homology is likely to outweigh any potential benefits of retaining these sites.
Initially, the rate of SNPs within the alignment is calculated as \(r=\frac{d}{L}\), where d is the total number of SNPs and L is the alignment length. For each pairwise comparison, a window size wl is chosen such that the expected number of SNPs within the window is equal to one: E(n∣p, wl) = 1. The maximum window size is set to 10 kilobases. At each SNP location, i, the number of SNPs, si, within the window centred at i is calculated. Assuming a binomial distribution, the probability of observing at least si SNPs is given by
A P value threshold is then specified to exclude SNPs that are centred within a window that has a significantly high rate of SNPs (P = 0.05 by default). A Bonferroni correction is made to control for multiple testing49. For SNPs that fall within a distance wl/2 of either edge of the alignment, wl is truncated.
Modified TransCluster algorithm
While SNP distances can be useful in distinguishing recent transmission events, they do not incorporate the sampling times of the hosts being considered. To incorporate the sampling times, TRACS includes an extended version of the TransCluster algorithm that both improves its speed and allows for the estimation of the expected number of intermediate hosts separating two samples.
Assuming SNPs accumulate at rate γ and an average epidemiological generation time (the time between transmission events) β, the TransCluster algorithm can be used to estimate the probability of k intermediate hosts occuring along the transmission chain between a pair of sampled hosts17. Given an SNP distance of N and a time separation of δ between the two samples, Stimson et al.17 showed that the distribution of k intermediate hosts could be written as
where h is the evolutionary time separating two samples. In the original implementation of TransCluster, this expression was computed via numerical integration. This method can become computationally prohibitive when considering thousands of pairwise comparisons. We show that this equation can be re-written as a finite sum that can be solved rapidly using efficient caching of intermediate values as follows:
A full derivation is given in Supplementary Methods. Given this derivation, the expected number of intermediate hosts can be written as
While this is an infinite sum, an upper bound on the error after truncating at a given k can be calculated (Supplementary Methods), allowing for the sum to be calculated to a user-specified precision. Again, TRACS makes use of efficient caching strategies to reduce the number of times this equation is calculated.
Clustering
Consistent with other transmission inference pipelines, TRACS implements single-linkage hierarchical clustering to identify putative transmission clusters. The cluster module takes a list of pairwise distance estimates output from the distance module and a user-supplied SNP or transmission distance threshold and outputs a CSV file indicating which genome belongs to each cluster. Guidance on the selection of an appropriate threshold is given below. Single linkage clustering is preferable as it allows for long transmission chains to be grouped into a single cluster.
Inferring a transmission threshold
To determine an appropriate SNP threshold for distinguishing recent transmission, we assume that it is possible to provide two sets of samples where the first includes possible transmission events and the second is unlikely to include any recent transmission. By fitting a mixture distribution to these samples, we can calculate a suitable threshold.
We assume that SNP distances between pairs of samples that are unlikely to be related by recent transmission (du) are distributed according to a negative binomial distribution with parameters n and p such that
Moreover, we assume that the distribution of SNP distances between pairs of SNPs related by recent transmission (dr) follows a Poisson distribution with mean λ
Thus, the distribution of SNP distances in the dataset that contains both recent transmission and more distant transmission (dm) can be modelled as a mixture of these two distributions, where q indicates the probability that any given pair of samples are related by recent transmission,
To estimate these parameters we fit the negative binomial model to the dataset containing exclusively distantly related samples using maximum likelihood. This produces a distribution of the SNP distance between distantly related samples on the basis of the second dataset. Substituting these estimates into the mixture then allows for the estimation of the remaining parameters using maximum likelihood. Note that, during this second estimation, we use the first dataset without the second dataset. Finally, an SNP threshold can be chosen by selecting the probability that a close transmission pair is erroneously excluded.
To select thresholds for the analysis of transmission in the mother and baby dataset36,37, repeated samples of the same baby were used as the ‘close transmission’ dataset and pairs of samples from different infants were used as the ‘distant’ dataset. In the FMT dataset, the ‘close transmission’ dataset consisted of samples from the same patient post-FMT, and the ‘distant’ dataset consisted of pairs of samples taken from unrelated donors. In both cases, a probability of missing a transmission of 0.001 was chosen. To adjust for the fact that the ‘close transmission’ pairs were from the same patient, the estimated thresholds were multiplied by three to allow for twice the evolutionary time to occur in the mother/donor.
Alternative algorithms and databases
In all cases, StrainPhlAn was run using version 4.0.5 with the default database using the command line parameters outlined in the study by Valles-Colomer et al.4.
For the metagenomic simulations, TRACS, InStrain and StrainGE were run using a reference database comprising the GTDB represenative genomes that correspond to the 11 bacterial species identified in a study investigating the transmission of drug-resistant pathogens using metagenomics24.
For the pneumococcal mixed strain simulations, TRACS and StrainGE were run using representative genomes from each sequencing cluster described in the Global Pneumococcal Sequencing project29. As InStrain required a dereplicated dataset, the 19F serotype included in the laboratory mixtures was used to run InStrain on the simulated pneumococcal deep population sequencing data.
The default databases for each tool were used in the analysis of the FMT metagenomic data35. The GTDB r207 bacterial reference database was used38 to run TRACS. InStrain was run using the UHGG reference database as described in the InStrain user manual. As it was computationally expensive to consider all possible species when running StrainGE on the FMT data, we restricted all analyses to the Bifidobacterium genus. Reference genomes were downloaded from Refseq and the database was constructed following the StrainGE user manual.
TRACS used the same pneumococcal reference database to analyse the pneumococcal deep sequencing data from the Maela refugee camp. To analyse the SARS-CoV-2 and P. falciparum deep population sequencing data, the reference genome from each species was used when running TRACS. Similar to the FMT data, TRACS was run using the GTDB r207 bacterial reference database to analyse the UK infant birth cohort.
To compare the resource requirements of each algorithm, we measured CPU time required for all simulated metagenomic samples where one genome was simulated to have been transmitted (Extended Data Fig. 4). Memory was compared on a representative pair of samples. Comparing the results across algorithms is challenging, as each requires different tasks to be performed separately by the user, such as read alignment and SNP distance calculation. To ensure a fair comparison, we used the same tools employed by TRACS for operations not automatically handled by other pipelines. Although the exact differences in speed and memory usage may vary depending on the dataset, TRACS consistently demonstrated the lowest memory and CPU usage in this test, outperforming the next best algorithm by approximately a factor of 2 (Extended Data Fig. 4).
The exact commands used to run StrainPhlAn, inStrain and StrainGE are given in the supplementary code provided in the associated GitHub repository50.
Simulated datasets
Given a user-specified SNP distance and a set of reference genomes, transmission of an individual strain between pairs of metagenomic or population sequencing samples were simulated using a custom python script provided as part of the TRACS package.
The number of reference genomes included in each sample was simulated using a Poisson distribution. The user-specified number of SNPs was then simulated by randomly introducing mutations along one reference genome. This genome was included in both samples in a pair. The remaining reference genomes in each sample were randomly selected from the user-provided list. In cases where both samples shared a reference genome other than the transmitted one, additional SNPs were randomly introduced. The total number of SNPs separating non-transmitted strains was drawn from a Poisson distribution with a mean of 10,000. This value was chosen to ensure that all methods should be able to reliably distinguish the small genetic distances characteristic of recent transmission from much larger distances representing hundreds of years of evolution. While closely related strains can be shared through routes other than transmission, such events cannot be distinguished solely by genetic distance. As a result, they are not considered in detail in these simulations.
Simulated sequencing depths were drawn from a Dirichlet distribution with all parameters equal to 1. A cumulative read depth across all genomes of 500 reads was used, and Illumina reads were generated using the ART synthetic read simulator51, similar to the simulation method used in the development of StrainGE10. This coverage corresponded to an average of 18 million and 7 million reads in the metagenomic and pneumococcal simulations, respectively. Simulated Nanopore reads were generated using Badread v0.4.152. A full set of parameters and commands can be found in the supplementary code provided in the GitHub repository50.
To simulate transmission between gut metagenomes, 11 bacterial species were selected from a study investigating the transmission of drug-resistant pathogens using metagenomics24. A series of high-quality reference genomes were then selected from the full GTDB r207 database that averaged at 100%, 99% and 97% in pairwise sequence identity from the GTDB r207 representative genomes for each species. Simulated transmission pairs were then generated at each identity level. To increase the complexity of the metagenomic simulations, and in line with previous simulation approaches9,10, real metagenomic samples were appended to the in silico simulated data. A pair of samples were chosen from the FMT dataset for this purpose, which were found not to share strains by all algorithms (accessions ERR9707885 and ERR9709032).
To simulate transmission among deeply sequenced populations of pneumococci, we randomly selected 28 pneumococcal genomes from a prior study53. A full list of all commands used along with the reference genomes is provided in the associated GitHub repository50.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
SARS-CoV-2 sequencing reads were taken from a recent study investigating the within-host evolution of the virus27. The European Nucleotide Archive (ENA) IDs of the 37 SARS-CoV-2 samples found to contain multiple distinct lineages are available via GitHub at https://github.com/gtonkinhill/tracs_manuscript (ref. 50). All sequencing reads from the original study are available in the ENA under accession number ERP126512. The S. pneumoniae and P. falciparum sequencing reads are available from the ENA under project codes PRJEB22771 and PRJNA482776, respectively7,34. The deep metagenomic sequencing reads of gut samples from mothers and their children are available from the ENA under the accession number ERP115334 (refs. 36,37). The FMT triad samples included sequencing data published in the ENA under project code PRJEB47909 (ref. 35). The sample accession codes used in all analyses are available via GitHub at https://github.com/gtonkinhill/tracs_manuscript (ref. 50).
Code availability
The TRACS code and user documentation is available (under an MIT open source licence) via GitHub at https://github.com/gtonkinhill/tracs/ (ref. 20). Supplementary code used to generate the analysis and figures in this article is available via GitHub at https://github.com/gtonkinhill/tracs_manuscript (ref. 50).
References
Hodcroft, E. B. et al. Spread of a SARS-CoV-2 variant through europe in the summer of 2020. Nature 595, 707–712 (2021).
Harris, S. R. et al. Evolution of MRSA during hospital transmission and intercontinental spread. Science 327, 469–474 (2010).
White, R. T. et al. Genomic epidemiology reveals geographical clustering of multidrug-resistant Escherichia coli ST131 associated with bacteraemia in Wales. Nat. Commun. 15, 1371 (2024).
Valles-Colomer, M. et al. The person-to-person transmission landscape of the gut and oral microbiomes. Nature 614, 125–135 (2023).
Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).
Lieberman, T. D. et al. Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures. Nat. Genet. 46, 82–87 (2014).
Tonkin-Hill, G. et al. Pneumococcal within-host diversity during colonization, transmission and treatment. Nat. Microbiol. 7, 1791–1804 (2022).
Wymant, C. et al. PHYLOSCANNER: inferring transmission from within- and between-host pathogen genetic diversity. Mol. Biol. Evol. 35, 719–733 (2017).
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
van Dijk, L. R. et al. StrainGE: a toolkit to track and characterize low-abundance strains in complex microbial communities. Genome Biol. 23, 74 (2022).
Zhao, C., Dimitrov, B., Goldman, M., Nayfach, S. & Pollard, K. S. MIDAS2: metagenomic intra-species diversity analysis system. Bioinformatics 39, btac713 (2023).
Nayfach, S., Rodriguez-Mueller, B., Garud, N. & Pollard, K. S. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 26, 1612–1625 (2016).
Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).
Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Large-scale sequence comparisons with sourmash. F1000Res. 8, 1006 (2019).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2014).
Stimson, J. et al. Beyond the SNP threshold: identifying outbreak clusters using inferred transmissions. Mol. Biol. Evol. 36, 587–603 (2019).
Seemann, T. Snippy v4.6.0: rapid haploid variant calling and core genome alignment. GitHub https://github.com/tseemann/snippy (2015).
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Tonkin-Hill, G. et al. TRACS code and user documentation. GitHub https://github.com/gtonkinhill/tracs/ (2026).
Campbell, F., Strang, C., Ferguson, N., Cori, A. & Jombart, T. When are pathogen genome sequences informative of transmission events?. PLoS Pathog. 14, e1006885 (2018).
Senghore, M. et al. Widespread sharing of pneumococcal strains in a rural African setting: proximate villages are more likely to share similar strains that are carried at multiple timepoints. Microb. Genom. 8, 000732 (2022).
Knight, J. R. et al. Determining the serotype composition of mixed samples of pneumococcus using whole-genome sequencing. Microb. Genom. 7, mgen000494 (2021).
Mu, A. et al. Reconstruction of the genomes of drug-resistant pathogens for outbreak investigation through metagenomic sequencing. mSphere 4, e00529–18 (2019).
Lythgoe, K. A. et al. SARS-CoV-2 within-host diversity and transmission. Science 372, 6539 (2021).
DNA Pipelines R&D et al. COVID-19 ARTIC v3 Illumina library construction and sequencing protocol c4. protocols.io https://doi.org/10.17504/protocols.io.bgxjjxkn (2020).
Tonkin-Hill, G. et al. Patterns of within-host genetic diversity in SARS-CoV-2. eLife 10, e66857 (2021).
De Maio, N. et al. Issues with SARS-CoV-2 sequencing data. Virological https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 (2020).
Gladstone, R. A. et al. International genomic definition of pneumococcal lineages, to contextualise disease, antibiotic resistance and vaccine impact. eBioMedicine 43, 338–346 (2019).
Lees, J. A. et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. eLife 6, e26255 (2017).
Belman, S., Pesonen, H., Croucher, N. J., Bentley, S. D. & Corander, J. Estimating between country migration in pneumococcal populations. G3 14, jkae058 (2024).
World Malaria Report 2023 (World Health Organization, 2023).
Zhu, S. J. et al. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. eLife 8, e40845 (2019).
Nkhoma, S. C. et al. Co-transmission of related malaria parasite lineages shapes within-host parasite diversity. Cell Host Microbe 27, 93–103 (2020).
Ianiro, G. et al. Variability of strain engraftment and predictability of microbiome composition after fecal microbiota transplantation across different diseases. Nat. Med. 28, 1913–1923 (2022).
Shao, Y. et al. Stunted microbiota and opportunistic pathogen colonization in caesarean-section birth. Nature 574, 117–121 (2019).
Shao, Y. et al. Primary succession of bifidobacteria drives pathogen resistance in neonatal microbiota assembly. Nat. Microbiol. 9, 2570–2582 (2024).
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).
Fujimura, K. E. et al. Neonatal gut microbiota associates with childhood multisensitized atopy and T cell differentiation. Nat. Med. 22, 1187–1191 (2016).
Stokholm, J. et al. Maturation of the gut microbiome and risk of asthma in childhood. Nat. Commun. 9, 141 (2018).
Dubois, L. et al. Paternal and induced gut microbiota seeding complement mother-to-infant transmission. Cell Host Microbe 32, 1011–1024 (2024).
Ennis, D., Shmorak, S., Jantscher-Krenn, E. & Yassour, M. Longitudinal quantification of Bifidobacterium longum subsp. infantis reveals late colonization in the infant gut independent of maternal milk HMO composition. Nat. Commun. 15, 894 (2024).
Selma-Royo, M. et al. Birthmode and environment-dependent microbiota transmission dynamics are complemented by breastfeeding during the first year. Cell Host Microbe 32, 996–1010 (2024).
De Maio, N., Worby, C. J., Wilson, D. J. & Stoesser, N. Bayesian reconstruction of transmission within outbreaks using genomic variants. PLoS Comput. Biol. 14, e1006117 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Li, H. htsbox, r312. GitHub https://github.com/lh3/htsbox (2013).
Minka, T. P. Estimating a Dirichlet distribution. Microsoft https://www.microsoft.com/en-us/research/publication/estimating-dirichlet-distribution/ (2000).
Tukey, J. W. Exploratory Data Analysis (Addison-Wesley, 1977).
Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8, 3–62 (1936).
Tonkin-Hill, G. et al. The TRACS algorithm and supplementary code. GitHub https://github.com/gtonkinhill/tracs_manuscript (2026).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Wick, R. Badread: simulation of error-prone long reads. J. Open Source Softw. 4, 1316 (2019).
Chewapreecha, C. et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat. Genet. 46, 305–309 (2014).
Acknowledgements
Funding was provided by the Australian Research Council (grant no. DE240100316 to G.T.-H.), the National Health and Medical Research Council (grant no. GNT2025515 to G.T.-H., GNT2013831 to O.X. and GNT2043549 to M.R.D.), The Wellcome Sanger Institute is supported by core funding from the Wellcome Trust (grant no. 206194 to Y.S., S.D.B. and T.D.L.), the Norwegian Research Council FRIPRO (grant no. 299941 to G.T.-H. and J.C.) and the ERC (grant no. 742158 to J.C.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the paper.
Author information
Authors and Affiliations
Contributions
G.T.-H. worked on study conception and design, method development, dataset construction, analysis, paper writing and editing. Y.S. worked on analysis. A.E.Z. and S.M. worked on method development and analysis. O.X., T.M. and H.A.T. worked on analysis. M.R.D., S.D.B. and T.D.L. worked on supervision. J.C. worked on supervision, study conception, design and funding acquisition. All authors read and approved the final paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Microbiology thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Accuracy of pairwise SNP distance estimation across varying strain frequencies and algorithms.
The relative error (destimated − dsimulated)/dsimulated, in the estimated pairwise SNP distance across four algorithms, when applied to a simulated mixture of pneumococcal genomes. Here, the relative SNP distance error is shown as a function of the minimum frequency of the transmitted strain within each sample pair. TRACS maintained consistent accuracy across a range of frequencies, whereas alternative algorithms exhibited higher error rates, particularly at lower strain frequencies. Only strains with an average sequence coverage of at least 5 × were considered. A single genome is simulated to have been transmitted between each pair of samples with a SNP distance given by the title of each facet of the plot. Ten replicate simulations were performed for each parameter set. Points closest to zero are the best performing, with positive values indicating an overestimation of the SNP distance and negative values indicating an underestimation.
Extended Data Fig. 2 Impact of filtering strategies on the accuracy of TRACS pairwise SNP distance estimation.
The relative error (destimated − dsimulated)/dsimulated, in the estimated pairwise SNP distance of the TRACS algorithm with different filtering strategies. A simulated SNP distance of 5 is used, with 10 replicates for each parameter set. ‘All filters’ refers to the full TRACS algorithm, while ‘no recombination’ and ‘no coverage’ indicate that the recombination scan statistic and the empirical Bayes coverage filters are turned off, respectively.
Extended Data Fig. 3 Consistency of TRACS SNP distance estimation across simulated short-read and long-read sequencing technologies.
The inferred versus simulated distance of running TRACS on simulated mixtures of pneumococcal strains, where each sample pair is simulated to have a single strain transmitted at the given SNP distance. Here, both Illumina (red) and Nanopore R10.4.1 (blue) sequencing reads have been simulated using the ART and Badread simulators, respectively. Boxplots indicate the distribution of the inferred number of SNPS. The central line represents the median, the edges of the box indicate the interquartile range (IQR), and the whiskers extend to 1.5 times the IQR. This indicates that TRACS can be used with high-accuracy long read sequencing technology.
Extended Data Fig. 4 Computational resource requirements and scalability of metagenomic SNP distance estimation pipelines.
A) The average CPU required for running all components required for each pipeline on a representative simulated pair of metagenomic samples where a single genome was simulated to have been transmitted. Here, InStrain (assembly) refers to the pipeline which relies on metaSpades to generate custom reference databases for each pair of samples. The time taken to run metaSpades is shaded and is only included as a representative example. Alternative assembly algorithms would have different resource requirements. In the remaining figures only the InStrain portion of the resource requirements is shown. B) The average CPU required for estimating SNP distances from pre-processed samples in each pipeline. As any pairwise SNP distance algorithm can be used on the MSAs generated by StrainPhlAn it has been excluded. C) The projected CPU time required for each pipeline if all pairs in an analysis were considered. This is the result of the individual sample preprocessing times in addition to the CPU required for the distance calculations multiplied by the number of pairwise comparisons. Here, we have excluded the assembly time in the InStrain (assembly) pipeline. Consequently, these runtimes represent any version of the InStrain pipeline that relies on pair-specific references. D) The memory required to run each pipeline. All resource comparisons were run on the same compute server, an Intel(R) Xeon(R) CPU E7-4850 v3 @ 2.20GHz with 112 CPUs and 3Tb of memory. These results offer a general indication of each algorithm’s performance. However, the exact differences in speed and memory usage will vary depending on the characteristics of each specific dataset.
Extended Data Fig. 5 Distribution of inferred SNP distances between related and unrelated participant pairs.
The number of SNPs for each species pair, inferred by each algorithm, between samples taken as part of a study investigating the impact of Faecal Microbiota Transplants (FMTs) (Ianiro et al., 2022). Distances are truncated at 3,000 SNPs. Transmission events inferred between samples from unrelated participants (shown in black) are highly likely to be false positives. In contrast, samples from the same participant following FMT (shown in red) are more likely to represent true positives, with an error rate proportional to that observed among unrelated participants. Both TRACS and InStrain show clear separation between the red and black distributions. However, due to its reliance on marker genes, StrainPhlAn often infers short SNP distances between unrelated sample pairs which makes distinguishing recent transmission difficult.
Extended Data Fig. 6 Impact of SNP thresholding strategies on transmission inference.
The inferred number of transmission pairs, inferred by each algorithm, between samples taken as part of a study investigating the impact of Faecal Microbiota Transplants (FMTs) (Ianiro et al., 2022). Transmission between samples from unrelated participants (shown in red) are highly likely to be false positives. In contrast, samples from the same participant following FMT (shown in black) are more likely to be true positives, with an error rate proportional to that observed in the unrelated participant results. Species specific SNP thresholds for identifying recent transmission were inferred using the mixture distribution approach (solid colour). Additional transmission pairs identified using the Youden cut-off method proposed in Valles-Collomer et al., 2023, are shown in a lighter colour with a dotted outline. This indicates that the Youden approach is substantially more relaxed and is likely to result in a high rate of false positives.
Extended Data Fig. 7 Comparison of species persistence rates inferred by TRACS and StrainPhlAn algorithms.
Species found to have higher rates of persistence when analysed using TRACS (blue) compared to the StrainPhlAn (red) algorithm as originally used in the initial analysis of this cohort (Shao et al., 2019). Due to the substantially high rate of false positives associated with StrainPhlan it is likely that additional species may experience frequent colonization by multiple strains which are not shown here.
Supplementary information
Supplementary Information (download PDF )
Supplementary Methods.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tonkin-Hill, G., Shao, Y., Zarebski, A.E. et al. Strain-level transmission inference across multi-kingdom metagenomic data using TRACS. Nat Microbiol (2026). https://doi.org/10.1038/s41564-026-02339-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41564-026-02339-x







