Extended Data Fig. 4: Computational resource requirements and scalability of metagenomic SNP distance estimation pipelines.
From: Strain-level transmission inference across multi-kingdom metagenomic data using TRACS

A) The average CPU required for running all components required for each pipeline on a representative simulated pair of metagenomic samples where a single genome was simulated to have been transmitted. Here, InStrain (assembly) refers to the pipeline which relies on metaSpades to generate custom reference databases for each pair of samples. The time taken to run metaSpades is shaded and is only included as a representative example. Alternative assembly algorithms would have different resource requirements. In the remaining figures only the InStrain portion of the resource requirements is shown. B) The average CPU required for estimating SNP distances from pre-processed samples in each pipeline. As any pairwise SNP distance algorithm can be used on the MSAs generated by StrainPhlAn it has been excluded. C) The projected CPU time required for each pipeline if all pairs in an analysis were considered. This is the result of the individual sample preprocessing times in addition to the CPU required for the distance calculations multiplied by the number of pairwise comparisons. Here, we have excluded the assembly time in the InStrain (assembly) pipeline. Consequently, these runtimes represent any version of the InStrain pipeline that relies on pair-specific references. D) The memory required to run each pipeline. All resource comparisons were run on the same compute server, an Intel(R) Xeon(R) CPU E7-4850 v3 @ 2.20GHz with 112 CPUs and 3Tb of memory. These results offer a general indication of each algorithm’s performance. However, the exact differences in speed and memory usage will vary depending on the characteristics of each specific dataset.