Main

Host-to-host transmission is a fundamental process that shapes the interactions between humans and microbes. Tracking the spread of pathogens using genomics has become a major tool in public health, helping to prevent the spread of disease at both local and global scales1,2. Beyond pathogens, understanding the transmission and colonization dynamics of commensal microbes would greatly improve our understanding of microbiome assembly and maintenance, and how this is influenced by diet, lifestyle, culture, clinical interventions and social interactions. In addition, the human microbiome contains microbes with therapeutic potential to treat a variety of human diseases. The ability to identify candidate strains during clinical interventions such as faecal microbiota transplantation (FMTs) and live biotherapeutic products would greatly de-risk and accelerate the development of microbiome-based therapeutics.

Whole-genome sequencing (WGS) has transformed our ability to infer transmission chains by detecting single nucleotide polymorphisms (SNPs), enabling precise tracking of slowly evolving pathogens such as methicillin-resistant Staphylococcus aureus2. However, most WGS analyses focus on a representative genome from a single species, overlooking within-host microbial strain diversity3. Recent metagenomic approaches address this by enabling simultaneous analysis of multiple species and strains4,5. Deep population sequencing of specific species can also be achieved through targeted enrichment methods, such as culture or PCR amplicon sequencing, supporting fine-scale analysis of co-colonizing strains6,7,8. Existing gold-standard methods for tracking transmission using metagenomic or deep population sequencing data have typically been developed for curated academic studies5,9,10,11. While these methods offer valuable insights, such as within-host variant calling and identification of selection pressures via dN:dS, they lack the speed and flexibility required for routine public health monitoring. Notably, they often lack the temporal resolution needed to accurately distinguish between strains transmitted recently (weeks to months) and distantly related genomes (years).

Tools that rely on reference marker gene databases including MIDAS11,12 and StrainPhlAn9 consider only a small portion of a species genome (10–200 genes4) and do not attempt to separate within-species diversity. This substantially limits the temporal resolution at which transmission can be inferred. An alternative approach, used by StrainGE10 and inStrain (database mode)5, involves identifying the species found across the entire dataset and building a dataset-specific reference genome database for read alignment. This approach relies heavily on the similarity between the transmitted and reference genome and does not allow for the continuous integration of new samples into an analysis, making it unsuitable for routine genomic surveillance.

Another common approach relies on de novo metagenomic assembly, including the inStrain (assembly mode)5 and STRONG13 pipelines. Assembly requires high sequencing coverage of the transmitted genome10. These methods generally perform best when samples are pooled before assembly, followed by genome binning in a co-assembly workflow. However, to avoid extensive genome deduplication, which effectively turns the process into a reference-based approach, they must be applied to pairs or reduced subsets of samples. This can substantially increase the computational burden. Recombination and shared homology between strains of the same species within a sample is also generally not accounted for by existing algorithms, which can have a major impact on the accuracy of transmission inference when considering metagenomic and population sequencing data. To address these issues, we developed TRAnsmision Clustering of Strains (TRACS), a highly accurate and easy-to-use algorithm for establishing whether two samples are plausibly related by a recent transmission event.

The TRACS algorithm distinguishes the transmission of closely related strains by identifying genetic differences as small as a few SNPs, which is crucial when considering slow-evolving pathogens. The algorithm employs statistical filtering techniques to account for variable sequence coverage, shared homology between strains, and sequencing errors. Critically, TRACS was designed to estimate an accurate but conservative (lower bound) of the SNP distance and considers each reference alignment independently, enabling continuous integration of new samples. This algorithmic approach allows TRACS to function similarly to how SNP distances are used in conventional single-isolate genomic epidemiology studies, making it an ideal tool for accurately identifying putative transmission networks and ruling out transmission events in ongoing public health applications. However, similar to single-isolate studies, discerning fine-scale transmission structure, such as the direction of transmission, typically requires additional epidemiological data, such as contact tracing information.

We demonstrate using comprehensive simulations, and by considering well characterized FMT metagenomic datasets, that TRACS has superior performance to existing metagenomic transmission inference methods and can reliably identify plausible transmission events. We apply TRACS to a diverse set of pathogens including viruses, bacteria and parasites, demonstrating the scalability and versatility of the algorithm.

Results

Algorithm overview

The initial ‘alignment’ stage of TRACS uses the hash-based search algorithm ‘Sourmash’ to identify a set of reference genomes that best represent the species and their strains in a sample14. Unlike alternative approaches, TRACS avoids the error-prone deconvolution of reads into different genome bins and instead aligns the entire read set to each reference genome individually, generating a count of the alleles observed at each site15. This enables the incremental addition of new samples and the integration of additional species into the reference database without requiring the reprocessing existing data.

A set of statistical filtering algorithms are then applied to exclude regions affected by shared sequence homology, multi-mapping reads, poor alignment quality and low sequencing coverage (Fig. 1). This includes a scan statistic, similar to those used in phylogenetics to detect recombination16, that can identify regions with elevated polymorphism rates, often resulting from shared sequence homology between co-colonizing strains or gene duplications. TRACS also includes an empirical Bayes approach to account for regions of a reference genome with insufficient coverage to accurately represent multiple strains present within a sample (Methods). The resulting filtered alignments are converted into reference-based multiple sequence alignments (MSAs) for each reference genome. Finally, TRACS incorporates a fast, International Union of Pure and Applied Chemistry (IUPAC)-aware pairwise SNP distance estimation algorithm.

Fig. 1: A schematic illustrating the key components of the TRACS algorithm.
Fig. 1: A schematic illustrating the key components of the TRACS algorithm.The alternative text for this image may have been generated using AI.
Full size image

Left: the reads are aligned to each reference genome identified by Sourmash separately. An empirical Bayes method is used to pinpoint genome regions with insufficient coverage for minority strain identification. Centre: alignments from each sample are transformed into an MSA using IUPAC ambiguity codes to represent multiple alleles at a single site. Rapid pairwise SNP distances are then calculated, excluding potential recombination regions by identifying areas with high SNP density. The TransCluster algorithm can optionally be applied to estimate the expected number of intermediate hosts between two samples. Right: the resulting pairwise transmission distance estimates are clustered using single linkage hierarchical clustering to infer putative transmission clusters. Transmission distance thresholds are inferred using a mixture distribution to separate sample pairs that are known to be distantly related from those that include recent transmissions.

Optionally, TRACS incorporates sampling dates and a known transmission generation time to estimate the expected number of intermediate hosts between samples using an extended version of the TransCluster algorithm17 (Methods). While the alignment module of TRACS is designed for metagenomic and population sequencing data, MSAs produced by alternative tools designed for isolate data, such as Snippy18, can also be used directly as input.

A key challenge in using pairwise distance methods to infer transmission chains is determining an appropriate threshold for when two strains are probably connected by recent transmission. TRACS introduces a method that uses a mixture distribution to distinguish recent transmission from distantly related strains, leveraging both closely related samples (for example, from the same individual) and distantly related samples to derive more specific thresholds than previous approaches based on Youden’s index4,19.

The TRACS algorithm is implemented in python and C++, and is available under the MIT open source licence20.

TRACS shows strong results on simulated pairs

To assess the ability of TRACS to accurately estimate small genetic distances in pairs of samples containing multiple strains of the same species, we simulated WGS reads from mixtures of common Streptococcus pneumoniae strains. Common transmission SNP thresholds for S. pneumoniae are often <10 SNPs21. In each pair we selected one genome to be shared between samples at a specified SNP distance (5, 50 and 500 SNPs) ensuring a minimum average sequencing coverage of 5× for the transmitted strain. The TRACS algorithm was compared with InStrain, StrainGE and StrainPhlAn.

All algorithms except TRACS substantially overestimated genetic distances (Fig. 2a). This bias was larger among low-frequency strains (Extended Data Fig. 1), limiting the ability of these algorithms to reliably rule out recent transmission events, as commonly used SNP distance thresholds for many species fall well below the resulting inflated estimates21. StrainGE was the best-performing algorithm after TRACS, probably owing to its ability to account for minority strains and its use of competitive mapping among genomes within a species. This approach helps mitigate the effects of horizontal gene transfer, as demonstrated in the original StrainGE publication10. However, StrainGE only reported relatively accurate SNP distances at higher values (500 SNPs), which exceeds common thresholds (typically <50 SNPs) used to distinguish transmission in S. pneumoniae22. To determine which filter in the TRACS algorithm had the greatest impact, we evaluated combinations of the statistical filtering methods implemented in TRACS (Extended Data Fig. 2). The coverage filter, including the empirical Bayes algorithm, produced the largest improvement, followed by the recombination filter. Repeating these simulations with in silico Nanopore R10.4.1 reads further demonstrated TRACS’ compatibility with multiple high-accuracy sequencing technologies (Extended Data Fig. 3).

Fig. 2: Accuracy of SNP distance estimation algorithms on simulated and laboratory mixtures.
Fig. 2: Accuracy of SNP distance estimation algorithms on simulated and laboratory mixtures.The alternative text for this image may have been generated using AI.
Full size image

a, The relative error (destimateddsimulated)/dsimulated, in the estimated pairwise SNP distance across four algorithms, when applied to a simulated mixture of pneumococcal genomes. A single genome is simulated to have been transmitted between each pair of samples with an SNP distance given on the x axis. Ten replicate simulations were performed for each parameter set with box plots indicating the distribution in relative error rates. Points closest to zero are the best performing, with positive values indicating an overestimation of the SNP distance and negative values indicating an underestimation. The central line represents the median, the edges of each box indicate the IQR and the whiskers extend to 1.5 times the IQR. b, The results of running each algorithm on artificial laboratory mixtures of pneumococcal strains from Knight et al.23. Ten of the resulting pairs of samples included at least one identical strain and thus the correct SNP distance should be zero. Except for TRACS, all algorithms erroneously estimate elevated SNP distances. The central line represents the median, the edges of each box indicate the IQR and the whiskers extend to 1.5 times the IQR. c, The simulated transmission of individual species was modelled between synthetic gut metagenome samples. Genomes were selected to have average nucleotide identities (ANI) of 100%, 99% and 97% relative to the GTDB representative genomes, which serve as the reference database for the InStrain, StrainGE and TRACS algorithms. StrainPhlan was run using the marker gene database (v4.0.5). Similar to the pneumococcal simulation, each of the ten species was simulated independently at three different SNP distance thresholds: 5, 50 and 500 SNPs. Only a single strain per species was included in each simulation. In addition to running InStrain in reference-based mode using GTDB representative genomes, InStrain was also run with custom reference genomes (InStrain assembly) generated for each sample pair using metaSPAdes v4.2.0. Contigs were ‘binned’ into species by aligning the assemblies back to the simulated genomes. The central line represents the median, the edges of each box indicate the IQR and the whiskers extend to 1.5 times the IQR.

To validate these results using real sequencing data with a known ground truth, we analysed 13 laboratory mixtures of different S. pneumoniae strains from a previous study23. In each case, at least one identical pneumococcal genome was shared between samples, meaning all algorithms should infer zero SNPs. As shown in Fig. 2b, TRACS was the only algorithm that reliably inferred zero SNPs in all cases. StrainPhlAn correctly identified zero SNPs in some cases; however, its reliance on marker genes can lead it to systematically underestimate SNP distances, as shown in the subsequent analysis of FMT data.

To further explore the reconstruction of transmission chains from metagenomic data involving multiple distinct species, we simulated genome mixtures representing common gut bacteria at varying levels of similarity to the reference database: 100%, 99% and 97% pairwise sequence identity (Methods). In these simulations, a single species identified in a previous study of transmission using metagenomics was simulated to transmit with a specified SNP distance24. A common reference database was used for all methods, with the exception of StrainPhlAn where the default marker gene database v4.0.5 was used. InStrain consistently overestimated SNP distances for most simulated pairs once the transmitted genome diverged by at least 1% from the reference database (Fig. 2c). Alternative strategies that generate study-specific reference databases, such as co-assembly of all samples or individual sample assembly followed by deduplication, could potentially improve these estimates. However, because inStrain relies on competitive read mapping, it is recommended that its reference databases be deduplicated so that any pair of genomes shares no more than 98% sequence identity. Consequently, in studies where multiple strains of the same species are present (a common occurrence in large datasets), strains that diverge from the assembled and/or deduplicated reference genome would still produce errors similar to those shown in Fig. 2c.

As an alternative, we explored generating sample-pair-specific reference genomes. In this mode (‘inStrain assembly’), metaSPAdes was used to assemble each pair of samples independently, producing a unique reference database for each pair (Fig. 2c). While this approach improved SNP distance estimates, its computational cost is prohibitive for larger studies. Excluding the computational cost of assembly, using pair-specific references requires the inStrain alignment and variant-calling steps to be run separately for each pair. This step alone required approximately 12.5 central processing unit (CPU) hours per pair (Extended Data Fig. 4). For a modest dataset of 100 samples, this would translate to 4,950 separate alignment steps and more than 2.5k CPU hours. Furthermore, if multiple strains of the same species were present within a given pair, this approach might still fail to assemble the transmitted strain as the reference genome.

Enhanced transmission estimates across diverse taxa

A major advantage of the TRACS algorithm is its ability to identify putative transmission events in cases where the hosts are colonized with multiple strains of the same species. This is a major concern in many high disease burden settings. Unlike alternative algorithms such as StrainPhlAn, the TRACS algorithm is applicable across diverse taxonomic groups including parasitic, viral and bacterial species. To demonstrate the effectiveness of the TRACS algorithm across a wide range of pathogens we considered well characterized datasets across three different settings.

SARS-CoV-2

Although infection with multiple severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains is relatively rare25, the rates of occurrence can be higher in locations with high disease burdens, such as within hospital wards, particularly when infection control procedures have broken down.

To demonstrate the ability of the TRACS algorithm to account for this challenges, we considered 37 of 1,181 SARS-CoV-2 samples, collected from the East of England in early 2020, which were processed using Illumina deep amplicon sequencing26. These samples were found to contain multiple distinct strains after sequencing each sample in replicate to account for sequencing errors27. These were compared with samples containing a single strain within the mixture. Thus, the smallest SNP distance between samples should be zero. To investigate the utility of applying TRACS to these samples, we compared the inferred SNP distance using TRACS with a consensus-based approach, with and without filtering for problematic sites such as hypermutable loci and regions impacted by the ends of amplicon sequencing reads28 (Fig. 3a).

Fig. 3: Transmission inference across diverse taxonomic groups using TRACS.
Fig. 3: Transmission inference across diverse taxonomic groups using TRACS.The alternative text for this image may have been generated using AI.
Full size image

a, A SARS-CoV-2 transmission network comprising samples that contain multiple distinct strains. Solid lines indicate transmission events observable using both TRACS and the typical consensus-based sequencing, while dashed lines represent additional links identified by TRACS alone. b, The inferred SNP distance between 21 pairs of 39 samples containing the same strain, where at least 1 sample includes multiple distinct strains. The consensus approach considers the dominant allele across the entire SARS-CoV-2 reference genome, whereas the ‘original’ approach excludes sites frequently filtered out to avoid hypervariable or error-prone regions. The TRACS algorithm correctly infers zero SNPs in all pairs. Box plots indicate the distribution of the inferred number of SNPS. The central line represents the median, the edges of the box indicate the IQR and the whiskers extend to 1.5 times the IQR. c, The expected number of intermediate hosts between each pair of 3,761 pneumococcal samples taken from different individuals in the Maela refugee camp versus the geographic distance between their homes within the camp with a LOESS smoothing. The shaded area indicates the corresponding confidence interval. Box plots show the median expected number of intermediate hosts. The edges of each box indicate the IQR, and the whiskers extend to 1.5 times the IQR. Sample pairs with a divergence time outside the establishment of the camp in 1984 were excluded. d, Similar to c, but only distances involving the three most common GPSCs found in 796 samples are shown. A strong geographic signal is absent for the non-multi-drug resistant lineage GPSC 1, which is known to have a longer carriage duration. e,f, The distribution of SNP distances inferred by TRACS for bulk versus single cell (e) and single cell versus single cell (f) samples. The vertical red line indicates the SNP threshold inferred using the TRACS mixture distribution approach. The high number of short SNP distances between single cells from the same sample indicates that TRACS can accurately distinguish closely related genomes within mixed infections of P. falciparum.

When the shared strain was in the minority, a consensus-based approach consistently overestimated the SNP distance between samples despite filtering out problematic regions. By contrast, TRACS correctly inferred 0 SNPs between samples in all cases without requiring manual filtering of problematic sites. Figure 3b indicates the transmission network inferred assuming 0 SNPs between strains for those mixed samples with a geographic location. The dotted edges indicate transmission links that would have been missed using a standard consensus-based framework. This example highlights the ability of TRACS to both account for the presence of multiple strains and to robustly control for the impacts of sequencing errors and hypermutable sites that frequently result in polymorphic variants within samples. TRACS is also likely to be robust to rare cases of recombination between strains within the host. In this case, ignoring polymorphisms could cause consensus methods to overestimate SNP distances, whereas, with sufficient coverage of both parent strains, TRACS would still detect shared strains between samples.

Streptococcus pneumoniae

An added benefit of the TRACS algorithm is its ability to estimate the expected number of intermediate hosts between two samples by incorporating sampling dates. Dates can be used as an additional piece of information to rule out transmission for species or lineages with low sequence diversity17.

To demonstrate this approach, we considered 3,761 nasopharyngeal swabs taken from 468 infants and 145 of their mothers living in a refugee camp in Thailand, where culture-based enrichment and whole pneumococcal population Illumina deep sequencing had been performed as part of a previous study7. TRACS was run on this dataset using a custom database of pneumococcal reference genomes from the Global Pneumococcal Sequencing project29. Consistent with previous studies, we assume a molecular clock rate of 5.3 SNPs per year and a transmission generation time of 2 months7. After running TRACS, the expected number of intermediate hosts between all pairs was compared with the geographic distance separating the households of participants (Fig. 3c,d). Although the original TransCluster algorithm was previously applied to this dataset, it only estimates individual probabilities, such as the probability of direct transmission, without estimating the overall expected number of intermediate hosts.

There was a strong correlation between geographic distance and the expected number of intermediate hosts, despite the small size of the refugee camp (2.4 km2). Interestingly, this correlation was not found for all the major lineages. In particular, common multi-drug-resistant lineages such as Global Pneumococcal Sequencing Cluster (GPSC) 1, had no clear association between transmission and geographic distance within the camp (Fig. 3d). These lineages are known to have longer carriage durations, which can obscure geographic transmission signals30,31. Thus, the observed signals are probably driven by lineages that transmit faster and exhibit shorter carriage durations, such as GPSC 20. Understanding the different dynamics governing the transmission of these lineages is crucial for the design of interventions aimed at reducing pneumococcal disease.

Plasmodium falciparum

The increased genome size of major parasitic pathogens has hampered the adoption of routine WGS in disease surveillance. However, the rapidly decreasing cost of sequencing is leading to the increasing use of WGS to track major parasite populations including the malaria parasite, P. falciparum, which causes over half a million deaths each year32. Multiple strains of P. falciparum are frequently found within people living in endemic areas with high burdens of disease33.

To investigate the ability of the TRACS algorithm to accurately identify shared strains within mixed P. falciparum infections, we considered a dataset involving 49 samples from Chikhwawa, Malawi, that were positive for P. falciparum34. Both Illumina bulk sequencing of mixed populations in addition to single-cell and single-clone enrichment had been performed as part of a previous study leading to 49 mixed WGS samples and 509 single genomes34. Although the sparse sampling makes direct transmission between samples unlikely, the combination of bulk and single-cell sequencing allows us to assess whether the TRACS algorithm can accurately detect shared strains within these mixed samples. Figure 3e,f indicates the distribution of the inferred SNP distances between single-cell genomes and bulk samples from either the same sample or between samples. The clear separation of SNP distances between genomes sequenced from the same sample and between samples indicates that TRACS can accurately identify shared strains within this data. As WGS of malaria parasites becomes more routine in public health settings, the TRACS algorithm can be used to accurately determine if a pair of samples could be related by a recent transmission event. A fine-scale understanding of transmission dynamics in high burden settings will aid in the design of strategies aimed at reducing and eventually eliminating malaria.

Improved engraftment estimates in faecal transplant triads

FMT involves transferring a donor stool sample to a recipient, often via orally-delivered capsules, and has demonstrated patient benefit to treat infections, autoimmune diseases, graft-versus-host diseases and cancers. The development of microbiome drugs often requires the identification of specific bacterial strains with beneficial properties from complex microbiomes containing hundreds to thousands of strains. To further evaluate the efficacy of the TRACS algorithm in metagenomics applications, we analysed an extensively characterized FMT dataset with well-defined transmission (or engraftment) relationships between samples35. The capability of each algorithm to detect engraftment links between samples from donors and recipients was examined by considering 23 previously published FMT triads35. Metagenomic samples were originally collected from donors and recipients before and after transplantation across three patient cohorts, including individuals with Clostridium difficile infections, inflammatory bowel disease and recurrent multi-drug-resistant infections.

The TRACS algorithm was run on each of these samples and the results compared with StrainPhlAn, InStrain and StrainGE (Fig. 4a–c and Extended Data Fig. 5). As StrainGE requires substantial computing resources to consider transmission of every observed genus, we restricted this analysis to bifidobacteria. The inferred number of engraftment events was then inferred, comparing samples from the same patient post-FMT and those from different cohorts (Fig. 4a). Species-specific SNP thresholds were chosen using both the mixture distribution method implemented in TRACS and the Youden index as has been used previously4 (Extended Data Fig. 6). No strain sharing is expected between cohorts, so inferred transmissions across cohorts are probably false positives. By contrast, bacterial strains are expected to persist within a patient’s gut over time, providing a real-world ground truth for method comparison. Strain persistence estimates may still produce false positives at rates similar to those seen between cohorts.

Fig. 4: Estimates of engraftment in FMT triads.
Fig. 4: Estimates of engraftment in FMT triads.The alternative text for this image may have been generated using AI.
Full size image

a, The inferred number of plausible transmission pairs (shared strains), inferred by each algorithm, between samples taken as part of study investigating the impact of FMTs. Transmission between samples from unrelated participants (shown in red) are highly likely to be false positives. By contrast, samples from the same participant following FMT (shown in black) are more likely to be true positives, with an error rate proportional to that observed in the unrelated participant results. Species-specific SNP thresholds for identifying recent transmission were inferred using the mixture distribution approach (Methods). An analogous plot using the Youden method, as described by Valles-Colomer et al.4, is given in Extended Data Fig. 6. b, Histograms indicating the distribution of pairwise SNP distances (truncated at 500 SNPs) between donor and recipient (post-transplant) samples for major bifidobacterial species. Vertical black lines indicate the SNP thresholds inferred using the TRACS mixture distribution method. InStrain identified the transmission of B. infantis, which is probably B. longum but mislabelled in the UHGG genome collection database, whereas StrainGE identified no transmission of B. longum. c, An example of multiple strains of B. longum being transmitted between a single donor and multiple recipients was detected exclusively by TRACS. The allele frequencies at polymorphic loci within a segment of the B. longum reference genome are shown for each recipient sample, with colours indicating the two distinct strains. The red strain is dominant in recipient sample SFMT_03_t15 but in the minority in a separate recipient (SFMT_27_t15) who received the same donor stool.

StrainPhlAn had the highest false-positive rate, which increased substantially using the Youden method for selecting SNP thresholds (Fig. 4a and Extended Data Figs. 5 and 6). Consistent with our simulation results, InStrain was the most conservative method, performing well when there is high sequence similarity between the reference and transmitted genomes. However, InStrain frequently overestimated SNP distances in cases of greater sequence divergence from the reference or when multiple strains of the same species were involved, as indicated by our previous simulation-based analysis. In contrast to alternative algorithms, TRACS achieved a high sensitivity, with a low false-positive rate.

To investigate this further, we considered the Bifidobacterium genus, and in particular the transmission of Bifidobacterium longum, a common pioneer colonizer of infant gut microbiota36,37, which is often used as an infant probiotic. Understanding strain persistence and transmission is crucial for developing effective microbiome-based therapeutics. InStrain and StrainPhlAn were run using their default databases, while StrainGE was supplied with a representative set of bifidobacteria reference genomes following the user guide (Methods). TRACS was run using the Genome Taxonomy Database (GTDB) as a reference database38.

TRACS was the most sensitive algorithm at detecting the transmission of B. longum, identifying 31 shared strains, 6 more than the next most sensitive method, StrainPhlAn, which identified 25 (Fig. 4b). InStrain and StrainGE proved to be the most conservative algorithms, while StrainPhlAn identified the highest rate of sharing. This high rate of sharing probably includes many false positives owing to StrainPhlAn’s reliance on a small set of representative genes. Despite having B. longum genomes present in the reference database, StrainGE failed to identify this species amongst the engrafted strains. By contrast, InStrain identified six B. infantis transplanted strains, as B. longum and B. infantis are considered as one species in InStrain’s default database (Unified Human Gastrointestinal Genome, UHGG).

Unlike competing algorithms, TRACS consistently identified cases of engraftment involving multiple strains of B. longum present within a single sample. One example is given in Fig. 4c, where the frequency of the engrafted strains was reversed in two patients who received the same donor stool. TRACS was the only algorithm to correctly identify that these patients shared similar strains.

Strain transmission and persistence in a UK birth cohort

The transmission and subsequent colonization events that drive the development of our gut microbiota can have important implications on our health in childhood and later life. In particular, the impact of perturbations driven by interventions in early childhood, such as caesarean section and antimicrobial treatment, have been associated with health complications including asthma and atopy39,40.

To characterize the transmission and persistence of bacteria at the strain level during early childhood, we analysed faecal samples from 1,288 healthy, full-term infants, sequenced as part of the UK BabyBiome Study36,37. Faecal samples were collected from all babies at least once during the neonatal period (≤1 month), with subsequent sampling from 302 infants, including 29 twins and 1 triplet, during later infancy (8.75 ± 1.98 months). Maternal faecal samples were also taken from 175 mothers, corresponding to 178 neonates. The TRACS algorithm was run on all samples using the GTDB reference database and SNP thresholds were calculated to distinguish recent transmission at the species level using the mixture distribution approach (Methods). Consistent with the low error rate of TRACS and stringent infection control procedures in each hospital, the occurrence of shared strains between unrelated children, both within the same hospital and across different hospitals, was very low (0.78% and 0.73%, respectively) (Fig. 5a). Interestingly, the highest rate of strain sharing occurred between siblings, reflecting their closely shared strain colonization history (from a common reservoir or between each other). Infants born via caesarean section exhibited markedly reduced strain sharing compared with those delivered vaginally (Fig. 5b), aligning with findings from previous studies36,37,41. Bifidobacterium bifidum, Bifidobacterium longum, Phocaeicola vulgatus, Parabacterioides disasonis and Escherichia coli showed some of the largest differences in transmission rates between mothers and infants born vaginally compared with those born via caesarean section.

Fig. 5: Maternal strain transmission and persistence across a large UK infant birth cohort.
Fig. 5: Maternal strain transmission and persistence across a large UK infant birth cohort.The alternative text for this image may have been generated using AI.
Full size image

a, A bar plot indicating the fraction of pairwise relationships that involve a putative recent transmission versus the number of shared strains. SNP distances and species-specific thresholds are inferred using the TRACS algorithm. Relationships between hosts are represented by different colours. The low false-positive rate of transmission is evident, as the vast majority of inter-hospital relationships involve zero transmissions. b, Species-specific transmission rates between 175 mothers and 178 babies. Data are presented as mean estimated transmission probabilities ± 95% confidence intervals. Point estimates and error bars are inferred using a binomial model of transmission if a mother is colonized (Supplementary Methods). Points are coloured by the mode of delivery (as in a), highlighting the species-specific impact of delivery. The total number of possible transmission pairs (mothers colonized) is shown in the accompanying horizontal bar plot. c, The rate of strain persistence by species from birth to 7 days, 21 days and late infancy. Data are presented as mean estimated persistence probabilities ± 95% confidence intervals. Point estimates and error bars are inferred using a binomial model similar to the previous transmission scenario. Persistence rates that differ significantly from day 7 are shown in red (two-sided Chi-squared test). Estimates are displayed only when at least 20 babies, who were originally observed to carry a species, are sampled again at respective time points. d, An example demonstrating the persistence of multiple distinct strains of B. breve in a single infant (ID: B00560) is presented. This example was not identified by StrainPhlAn, which only considers the dominant genotype. The frequencies of 1,000 randomly chosen alleles, found to be at intermediate frequency on day 4, are shown. Initially, the purple strain dominates, but by days 7 and 21, the green strain becomes dominant. e, Plotting of individual allele frequencies of two B. breve strains (from d) relative to a portion of the B. breve reference genome.

In addition to tracking strain transmission between hosts, TRACS can monitor the longitudinal persistence of strains within a single host. This includes observing the strain stability in infants, which was found to vary significantly between species (Fig. 5c). Strains of commonly pathogenic species such as E. coli and E. faecalis were maintained for relatively short periods of time and were generally eliminated or replaced by late infancy. The persistence of the pioneer colonizers in the neonatal gut, including B. longum and B. breve strains37, remained high during the first 3 weeks of infancy but declined markedly in late infancy, probably owing to changes in diet and decreased breastfeeding rates42,43. By contrast, some other maternally transmitted commensal genera, such as B. bifidum and closely related Phocaeicola, Bacteroides and Parabacteroides species, were maintained at higher rates for the entire duration of the study. Critically, the TRACS algorithm identified 40.5% (49/121) additional strain sharing events compared with StrainPhlAn. This indicates instances of repeated co-colonization of infants with multiple strains of B. breve, which were overlooked by algorithms such as StrainPhlAn that consider only the dominant genotype (Extended Data Fig. 7). For instance, Fig. 5d,e illustrates the persistence of two B. breve strains over three time points within a single infant. Initially, the purple strain is dominant with a mean allele frequency of 0.82 (±0.081) but then drops to 0.063 (±0.181) and 0.249 (±0.156), respectively, at later time points. Taken together, these results show that TRACS reliably identifies fine-scale strain transmission and persistence in the human microbiota.

Discussion

TRACS is a versatile, modular algorithm for inferring recent pairwise transmissions from metagenomic and single-species population sequencing. It corrects for sequencing errors, uneven coverage and shared sequence homology between strains, and accounts for multiple strains of the same species within a sample. Using simulations and artificial laboratory mixtures of S. pneumoniae, we show that TRACS has a higher accuracy compared with alternative algorithms.

The scalability and versatility of the TRACS algorithm allows it to be applied to a broad spectrum of pathogens including viruses, bacteria and parasites. This includes contexts where species-specific whole-genome bioinformatics transmission pipelines are yet to be developed. For example, although WGS is not typically used to track the transmission of malaria-causing parasites, we demonstrate using single-cell genomics, that by omitting the complex and error-prone step of deconvolution, the TRACS algorithm effectively identifies shared strains of P. falciparum between samples.

Beyond pathogen surveillance, TRACS can track the colonization and carriage dynamics of the human microbiota. Using metagenomic data from FMT studies, we demonstrate that TRACS identifies instances of minority strain transmission between FMT donors and recipients, as well as longitudinally sampled infants, which are not detected by alternative tools. In addition to confirming that caesarean sections substantially reduce mother-to-baby bacterial transmission, we found species-specific differences in colonization and persistence rates. Notably, E. faecalis and B. cellulosilyticus had higher transmission rates following caesarean delivery. As with other genomic transmission studies, it is essential to consider the broader epidemiology of host interactions. For example, in some cases, both mother and infant may be independently colonized by the same strain from the hospital environment or other unsampled sources, such as family members.

TRACS was developed to identify recent transmission events. However, the algorithm is not intended for identifying high-quality within-host variants, making it unsuitable for detecting selection through dN:dS ratios. Furthermore, as the algorithm provides only a lower bound estimate of transmission or SNP distance, it cannot accurately infer genetic relationships over thousands of SNPs, which represent much longer evolutionary timescales. For these purposes, tools such as inStrain and StrainPhlAn, respectively, are more appropriate. Similar to most minority variant-calling algorithms, TRACS requires sufficient sequencing depth to distinguish true variants from sequencing errors. Consequently, by default, TRACS requires a minimum coverage of 5× coverage to accurately estimate distances between strains. As with other metagenomic transmission inference methods, TRACS does not determine transmission direction. For this, we recommend using TRACS to identify potential transmission events, followed by more computationally intensive phylogenetic methods that incorporate additional epidemiological information8,44.

Metagenomics and deep population sequencing provide a powerful alternative to relying on single reference genomes for tracking microbial transmission and persistence. TRACS enables simultaneous analysis of multiple strains and species, offering insights into how microbes spread, establish and persist within their hosts.

Methods

TRACS algorithm

The TRACS algorithm consists of three central modules that can be used independently (Fig. 1). The ‘alignment’ module generates a polymorphism-aware alignment against each reference genome found within a given sample. Given either a multi-sample reference-based alignment or a traditional MSA, the ‘distance’ module estimates pairwise SNP distances. Optionally, the distance module can also estimate the expected number of intermediate hosts separating pairs of samples using an extended version of the TransCluster algorithm17. The TransCluster algorithm requires estimates of the transmission generation time (transmissions per year) and clock rate (SNPs per genome per year) of the species to be provided. Finally, the ‘cluster’ module estimates putative transmission clusters using single-linkage hierarchical clustering.

Alignment

The alignment module takes raw sequence data as input and produces alignments to multiple reference genomes in FASTA format. IUPAC ambiguity codes are used to represent polymorphic sites in each alignment, enabling the estimation of SNP distances between populations. The alignment module supports a variety of database formats tailored to specific applications. Users can provide a single reference, utilize a Sourmash database that incorporates Genbank genome IDs or opt for a custom TRACS database. The latter includes both reference genomes and a corresponding Sourmash database, which can be constructed using the TRACS ‘build-db’ command.

If a database containing multiple references is provided, the algorithm will first select those represented in a sample using the Sourmash gather command14. Minimap245 is then used to align all reads within a sample to each reference independently and a pileup of alleles identified at each site in the reference is generated using htsbox46.

To control for variable sequencing coverage, we developed an empirical Bayes method. This is applied to each genome alignment independently. The counts of minority variants at each site within a single reference alignment are modelled jointly using a multinomial Dirichlet distribution. This allows for the estimation of the posterior frequency of variants across an individual alignment by considering the average frequency of minority variants at polymorphic sites as follows. Assume ni1 is the read count of the most frequent allele at polymorphic site i in an alignment with ni2, ni3 and ni4 being the counts of the 2nd, 3rd and 4th most frequent allele, respectively. Our goal is to estimate a prior on the expected frequency of each allele at a given site by considering the observed distribution of allele frequencies across all sites in the dataset.

We assume a Dirichlet prior on the joint frequency (pi = (pi1, pi2, pi3, pi4)) of the counts ni, with parameter vector α such that the probability density can be written as

$$p({\bf{p}}) \sim D({{\alpha }}_{1},{{\alpha }}_{2},{{\alpha }}_{3},{{\alpha }}_{4})=\frac{\varGamma ({\sum }_{k}{{\alpha }}_{k})}{{\prod }_{k}\varGamma ({{\alpha }}_{k})}\mathop{\prod }\limits_{k}{p}_{k}^{{{\alpha }}_{k}-1}.$$

Assuming the resulting read counts at each site are drawn from a multinomial distribution, the marginal likelihood is the Dirichlet-multinomial distribution given by

$$p({\bf{D}}| {\mathbf{\upalpha }})=\mathop{\prod }\limits_{i}\left(\frac{\varGamma ({\sum }_{k}{{\alpha }}_{k})}{\varGamma ({\sum }_{k}{n}_{ik}+{\sum }_{k}{{\alpha }}_{k})}\mathop{\prod }\limits_{k}\frac{\varGamma ({n}_{ik}+{{\alpha }}_{k})}{\varGamma ({{\alpha }}_{k})}\right),$$

where D indicates the vector of read counts at each site i: D = {x1, …, xN}, where xi = {ni1, ni2, ni3, ni4}.

Point estimates of the α parameters were made by maximizing the marginal likelihood using the fixed point iteration method47. The maximum a posteriori estimate of the frequency of each allele under this prior at each site can then be estimated as

$${f}_{ij}=\frac{{{\alpha }}_{j}+{n}_{ij}}{{\sum }_{k}{n}_{ik}+{\sum }_{k}{{\alpha }}_{k}}.$$

After estimating posterior frequencies at all variable sites in an alignment, TRACS applies a frequency cut-off which requires a coverage of at least five reads by default. Sites falling below this threshold are then excluded from further consideration in subsequent estimates of SNP distances. Furthermore, to control for major outliers in sequencing coverage, regions with coverage exceeding 1.5 times the interquartile range (IQR) above or below the quartiles, as identified by Tukey’s method, are excluded48.

Distance

Pairwise SNP distances are estimated using a bitset-based algorithm to enhance both the speed and memory efficiency of the process, whilst incorporating IUPAC ambiguity codes. Each sequence in an MSA is represented as a matrix of bits (0s and 1s), where rows indicate the four possible alleles and columns indicate the position in the sequence. Sites where multiple different alleles are observed are then represented by multiple 1s within a single column. A series of bitwise ‘AND’ operations are used per allele, and the results are combined with ‘OR’ operations to rapidly calculate the sites which share common alleles between each pair of samples.

To account for the impacts of recombination and shared homology between strains, we include an optional additional step to identify regions within the pairwise comparison that have an elevated rate of SNPs. This uses a similar scan statistic to that used in popular recombination-aware phylogenetic reconstruction algorithms16. While this step may exclude sites that could be informative for transmission, SNPs introduced by recombination are rarely incorporated into public health genomic pipelines owing to the challenges they pose for phylodynamic analyses. In addition, the noise introduced by cross-mapping from shared homology is likely to outweigh any potential benefits of retaining these sites.

Initially, the rate of SNPs within the alignment is calculated as \(r=\frac{d}{L}\), where d is the total number of SNPs and L is the alignment length. For each pairwise comparison, a window size wl is chosen such that the expected number of SNPs within the window is equal to one: E(np, wl) = 1. The maximum window size is set to 10 kilobases. At each SNP location, i, the number of SNPs, si, within the window centred at i is calculated. Assuming a binomial distribution, the probability of observing at least si SNPs is given by

$$P(S\ge {s}_{i})=1-\mathop{\sum }\limits_{k=0}^{{s}_{i}}\left(\begin{array}{l}{w}_{l}\\ k\end{array}\right){p}^{k}{(1-p)}^{{w}_{l}-k}.$$

A P value threshold is then specified to exclude SNPs that are centred within a window that has a significantly high rate of SNPs (P = 0.05 by default). A Bonferroni correction is made to control for multiple testing49. For SNPs that fall within a distance wl/2 of either edge of the alignment, wl is truncated.

Modified TransCluster algorithm

While SNP distances can be useful in distinguishing recent transmission events, they do not incorporate the sampling times of the hosts being considered. To incorporate the sampling times, TRACS includes an extended version of the TransCluster algorithm that both improves its speed and allows for the estimation of the expected number of intermediate hosts separating two samples.

Assuming SNPs accumulate at rate γ and an average epidemiological generation time (the time between transmission events) β, the TransCluster algorithm can be used to estimate the probability of k intermediate hosts occuring along the transmission chain between a pair of sampled hosts17. Given an SNP distance of N and a time separation of δ between the two samples, Stimson et al.17 showed that the distribution of k intermediate hosts could be written as

$$\begin{array}{c}P(k| h,\delta )=\frac{{{\rm{e}}}^{-\delta (\lambda +\beta )}{\lambda }^{N+1}{\beta }^{k}}{k!F(N)}{\int }_{h=0}^{\infty }{{\rm{e}}}^{-h(\lambda +\beta )}{(h+\delta )}^{k}\mathop{\sum }\limits_{i=0}^{N}{h}^{i}\left(\frac{{\delta }^{N-i}}{i!(N-i)!}\right){\rm{d}}h,\end{array}$$

where h is the evolutionary time separating two samples. In the original implementation of TransCluster, this expression was computed via numerical integration. This method can become computationally prohibitive when considering thousands of pairwise comparisons. We show that this equation can be re-written as a finite sum that can be solved rapidly using efficient caching of intermediate values as follows:

$$P(k| h,\delta )=\frac{{\lambda }^{N+1}{\beta }^{k}(N+k)!}{{{\rm{e}}}^{\delta \beta }N!k!{\sum }_{i=0}^{N}\frac{{(\lambda \delta )}^{i}}{i!}}\mathop{\sum }\limits_{i=0}^{N+k}\frac{{\delta }^{N+k-i}}{(N+k-i)!{(\lambda +\beta )}^{i+1}}.$$

A full derivation is given in Supplementary Methods. Given this derivation, the expected number of intermediate hosts can be written as

$$\begin{array}{rcl}E(K| h,\delta ) & = & \mathop{\sum }\limits_{k=0}^{\infty }kP(k| h,\delta )\\ & = & \mathop{\sum }\limits_{k=0}^{\infty }k\frac{{\lambda }^{N+1}{\beta }^{k}(N+k)!}{{{\rm{e}}}^{\delta \beta }N!k!{\sum }_{i=0}^{N}\frac{{(\lambda \delta )}^{i}}{i!}}\mathop{\sum }\limits_{i=0}^{N+k}\frac{{\delta }^{N+k-i}}{(N+k-i)!{(\lambda +\beta )}^{i+1}}\end{array}.$$

While this is an infinite sum, an upper bound on the error after truncating at a given k can be calculated (Supplementary Methods), allowing for the sum to be calculated to a user-specified precision. Again, TRACS makes use of efficient caching strategies to reduce the number of times this equation is calculated.

Clustering

Consistent with other transmission inference pipelines, TRACS implements single-linkage hierarchical clustering to identify putative transmission clusters. The cluster module takes a list of pairwise distance estimates output from the distance module and a user-supplied SNP or transmission distance threshold and outputs a CSV file indicating which genome belongs to each cluster. Guidance on the selection of an appropriate threshold is given below. Single linkage clustering is preferable as it allows for long transmission chains to be grouped into a single cluster.

Inferring a transmission threshold

To determine an appropriate SNP threshold for distinguishing recent transmission, we assume that it is possible to provide two sets of samples where the first includes possible transmission events and the second is unlikely to include any recent transmission. By fitting a mixture distribution to these samples, we can calculate a suitable threshold.

We assume that SNP distances between pairs of samples that are unlikely to be related by recent transmission (du) are distributed according to a negative binomial distribution with parameters n and p such that

$$p({d}_{{\rm{u}}}=x)=\frac{\varGamma (x+n)}{\varGamma (n)x!}{p}^{n}{(1-p)}^{x}.$$

Moreover, we assume that the distribution of SNP distances between pairs of SNPs related by recent transmission (dr) follows a Poisson distribution with mean λ

$$p({d}_{{\rm{r}}}=x)=\frac{{\lambda }^{x}{{\rm{e}}}^{-\lambda }}{x!}.$$

Thus, the distribution of SNP distances in the dataset that contains both recent transmission and more distant transmission (dm) can be modelled as a mixture of these two distributions, where q indicates the probability that any given pair of samples are related by recent transmission,

$$p({d}_{{\rm{m}}}=x)=q* p({d}_{{\rm{r}}}=x)+(1-q)* p({d}_{{\rm{u}}}=x).$$

To estimate these parameters we fit the negative binomial model to the dataset containing exclusively distantly related samples using maximum likelihood. This produces a distribution of the SNP distance between distantly related samples on the basis of the second dataset. Substituting these estimates into the mixture then allows for the estimation of the remaining parameters using maximum likelihood. Note that, during this second estimation, we use the first dataset without the second dataset. Finally, an SNP threshold can be chosen by selecting the probability that a close transmission pair is erroneously excluded.

To select thresholds for the analysis of transmission in the mother and baby dataset36,37, repeated samples of the same baby were used as the ‘close transmission’ dataset and pairs of samples from different infants were used as the ‘distant’ dataset. In the FMT dataset, the ‘close transmission’ dataset consisted of samples from the same patient post-FMT, and the ‘distant’ dataset consisted of pairs of samples taken from unrelated donors. In both cases, a probability of missing a transmission of 0.001 was chosen. To adjust for the fact that the ‘close transmission’ pairs were from the same patient, the estimated thresholds were multiplied by three to allow for twice the evolutionary time to occur in the mother/donor.

Alternative algorithms and databases

In all cases, StrainPhlAn was run using version 4.0.5 with the default database using the command line parameters outlined in the study by Valles-Colomer et al.4.

For the metagenomic simulations, TRACS, InStrain and StrainGE were run using a reference database comprising the GTDB represenative genomes that correspond to the 11 bacterial species identified in a study investigating the transmission of drug-resistant pathogens using metagenomics24.

For the pneumococcal mixed strain simulations, TRACS and StrainGE were run using representative genomes from each sequencing cluster described in the Global Pneumococcal Sequencing project29. As InStrain required a dereplicated dataset, the 19F serotype included in the laboratory mixtures was used to run InStrain on the simulated pneumococcal deep population sequencing data.

The default databases for each tool were used in the analysis of the FMT metagenomic data35. The GTDB r207 bacterial reference database was used38 to run TRACS. InStrain was run using the UHGG reference database as described in the InStrain user manual. As it was computationally expensive to consider all possible species when running StrainGE on the FMT data, we restricted all analyses to the Bifidobacterium genus. Reference genomes were downloaded from Refseq and the database was constructed following the StrainGE user manual.

TRACS used the same pneumococcal reference database to analyse the pneumococcal deep sequencing data from the Maela refugee camp. To analyse the SARS-CoV-2 and P. falciparum deep population sequencing data, the reference genome from each species was used when running TRACS. Similar to the FMT data, TRACS was run using the GTDB r207 bacterial reference database to analyse the UK infant birth cohort.

To compare the resource requirements of each algorithm, we measured CPU time required for all simulated metagenomic samples where one genome was simulated to have been transmitted (Extended Data Fig. 4). Memory was compared on a representative pair of samples. Comparing the results across algorithms is challenging, as each requires different tasks to be performed separately by the user, such as read alignment and SNP distance calculation. To ensure a fair comparison, we used the same tools employed by TRACS for operations not automatically handled by other pipelines. Although the exact differences in speed and memory usage may vary depending on the dataset, TRACS consistently demonstrated the lowest memory and CPU usage in this test, outperforming the next best algorithm by approximately a factor of 2 (Extended Data Fig. 4).

The exact commands used to run StrainPhlAn, inStrain and StrainGE are given in the supplementary code provided in the associated GitHub repository50.

Simulated datasets

Given a user-specified SNP distance and a set of reference genomes, transmission of an individual strain between pairs of metagenomic or population sequencing samples were simulated using a custom python script provided as part of the TRACS package.

The number of reference genomes included in each sample was simulated using a Poisson distribution. The user-specified number of SNPs was then simulated by randomly introducing mutations along one reference genome. This genome was included in both samples in a pair. The remaining reference genomes in each sample were randomly selected from the user-provided list. In cases where both samples shared a reference genome other than the transmitted one, additional SNPs were randomly introduced. The total number of SNPs separating non-transmitted strains was drawn from a Poisson distribution with a mean of 10,000. This value was chosen to ensure that all methods should be able to reliably distinguish the small genetic distances characteristic of recent transmission from much larger distances representing hundreds of years of evolution. While closely related strains can be shared through routes other than transmission, such events cannot be distinguished solely by genetic distance. As a result, they are not considered in detail in these simulations.

Simulated sequencing depths were drawn from a Dirichlet distribution with all parameters equal to 1. A cumulative read depth across all genomes of 500 reads was used, and Illumina reads were generated using the ART synthetic read simulator51, similar to the simulation method used in the development of StrainGE10. This coverage corresponded to an average of 18 million and 7 million reads in the metagenomic and pneumococcal simulations, respectively. Simulated Nanopore reads were generated using Badread v0.4.152. A full set of parameters and commands can be found in the supplementary code provided in the GitHub repository50.

To simulate transmission between gut metagenomes, 11 bacterial species were selected from a study investigating the transmission of drug-resistant pathogens using metagenomics24. A series of high-quality reference genomes were then selected from the full GTDB r207 database that averaged at 100%, 99% and 97% in pairwise sequence identity from the GTDB r207 representative genomes for each species. Simulated transmission pairs were then generated at each identity level. To increase the complexity of the metagenomic simulations, and in line with previous simulation approaches9,10, real metagenomic samples were appended to the in silico simulated data. A pair of samples were chosen from the FMT dataset for this purpose, which were found not to share strains by all algorithms (accessions ERR9707885 and ERR9709032).

To simulate transmission among deeply sequenced populations of pneumococci, we randomly selected 28 pneumococcal genomes from a prior study53. A full list of all commands used along with the reference genomes is provided in the associated GitHub repository50.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.