Table 3 Popular computational tools designed to mitigate contamination in sequence data
From: Guidelines for preventing and reporting contamination in low-biomass microbiome studies
Tool | Principle/approach | Strengths | Limitations/considerations |
---|---|---|---|
decontam (prevalence mode)43 | Principle Contaminating taxa are more prevalent in negative controls than in true samples. Approach decontam prevalence mode uses Chi-square or Fisher’s exact tests comparing the presence–absence of each taxon in samples and negative controls. Contaminating taxa are completely removed from the dataset. | Requires no previous knowledge of contaminant sources Provides an alternative test threshold that allows identification of non-contaminant taxa when significant contaminants are expected, such as in extremely low-biomass samples | The model does not account for situations when a taxon is both a contaminant and a genuine community member Reduced sensitivity to detect contaminants present only in very few samples or with fewer negative controls |
decontam (frequency mode)43 | Principle Contaminating taxa have higher frequencies in samples with lower input microbial DNA/biomass (inverse correlation). Approach decontam frequency mode compares linear fits of log-transformed frequency of each taxon with log-transformed total DNA (or other biomass proxies) to a contaminant model with negative one slope and a non-contaminant model with zero slope. Contaminating taxa are completely removed from the dataset. | Requires no previous knowledge of contaminant sources Can be applied even when negative controls are not available or insufficient | Limited performance when contaminants comprise a major fraction of sequences Requires per-sample measurements of microbial DNA or biomass Model assumptions are violated if microbial biomass systematically differs between sample groups |
SourceTracker45 | Principle Contaminating taxa in the sample are introduced from diverse external sources. Approach SourceTracker uses a Bayesian approach to determine the proportion of a sample community that is consistent with known contaminating source communities. | Allows estimation of the contribution of possible sources to contamination in a sample Provides modelling of uncertainty regarding known and unknown source environments | Limited utility when there is insufficient information regarding the community composition of the contaminating source Does not specify contaminating taxa Not applicable in case of cross-contamination |
microDecon86 | Principle Taxa from a common contamination source will be introduced together to the samples at similar proportions. Approach microDecon proportionally removes taxa present in negative controls from the samples on the basis of the ratio between taxa observed in controls with an anchor contaminant (a taxon shared by controls and samples and determined to be the most probable contaminant by multiple linear regressions). | Requires no previous knowledge of contaminant sources Allows partial removal of taxa genuinely present in both the sample environment and the contaminating sources | microDecon processes one sample at a time, ignoring information shared across samples Limited performance when control communities show substantial variability Partial removal of reads from taxa commonly present in negative controls can have unpredictable consequences for subsequent analyses |
SCRuB44 | Principle Contamination from a common source will be introduced at similar proportions across samples. Approach SCRuB uses a probabilistic approach to model observed data likelihood by estimating the sample composition, shared contamination sources, the proportion of each sample from contamination and the spatial position of samples during processing. Contaminating taxa are proportionally removed from the samples to maximize the data likelihood. | Requires no previous knowledge of contaminant sources Model includes cross-contamination among samples and spatial location (for example, location on a 96-well plate) of a sample during processing Allows partial removal of taxa genuinely present in both the sample environment and the contaminating sources Allows accounting for multiple contamination sources, including across different batches | Maximizing utility requires sufficient information regarding batches and well locations Partial removal of reads from taxa commonly present in negative controls can have unpredictable consequences for subsequent analyses |
Squeegee82 | Principle Taxa from a common contamination source will be introduced together to the samples at similar proportions. Approach Squeegee predicts contaminants in taxonomically classified reads by evaluating taxa prevalence across samples, pairwise similarity between samples and coverage of reference genome of the candidate contaminant species. Contaminating taxa are completely removed from the dataset. | Requires no previous knowledge of contaminant sources Can be applied even when negative controls are not available or insufficient Designed for decontaminating metagenomic datasets May allow the detection of batch-specific or cross-contaminants by analysing individual sample batches independently | Limited performance outside abundant contaminating taxa Requires multiple sample groups with highly dissimilar communities that are exposed to the same potential contaminants The model does not account for situations when a taxon is both a contaminant and a genuine community member |