Table 3 Popular computational tools designed to mitigate contamination in sequence data

From: Guidelines for preventing and reporting contamination in low-biomass microbiome studies

Tool

Principle/approach

Strengths

Limitations/considerations

decontam (prevalence mode)43

Principle

Contaminating taxa are more prevalent in negative controls than in true samples.

Approach

decontam prevalence mode uses Chi-square or Fisher’s exact tests comparing the presence–absence of each taxon in samples and negative controls. Contaminating taxa are completely removed from the dataset.

Requires no previous knowledge of contaminant sources

Provides an alternative test threshold that allows identification of non-contaminant taxa when significant contaminants are expected, such as in extremely low-biomass samples

The model does not account for situations when a taxon is both a contaminant and a genuine community member

Reduced sensitivity to detect contaminants present only in very few samples or with fewer negative controls

decontam (frequency mode)43

Principle

Contaminating taxa have higher frequencies in samples with lower input microbial DNA/biomass (inverse correlation).

Approach

decontam frequency mode compares linear fits of log-transformed frequency of each taxon with log-transformed total DNA (or other biomass proxies) to a contaminant model with negative one slope and a non-contaminant model with zero slope. Contaminating taxa are completely removed from the dataset.

Requires no previous knowledge of contaminant sources

Can be applied even when negative controls are not available or insufficient

Limited performance when contaminants comprise a major fraction of sequences

Requires per-sample measurements of microbial DNA or biomass

Model assumptions are violated if microbial biomass systematically differs between sample groups

SourceTracker45

Principle

Contaminating taxa in the sample are introduced from diverse external sources.

Approach

SourceTracker uses a Bayesian approach to determine the proportion of a sample community that is consistent with known contaminating source communities.

Allows estimation of the contribution of possible sources to contamination in a sample

Provides modelling of uncertainty regarding known and unknown source environments

Limited utility when there is insufficient information regarding the community composition of the contaminating source

Does not specify contaminating taxa

Not applicable in case of cross-contamination

microDecon86

Principle

Taxa from a common contamination source will be introduced together to the samples at similar proportions.

Approach

microDecon proportionally removes taxa present in negative controls from the samples on the basis of the ratio between taxa observed in controls with an anchor contaminant (a taxon shared by controls and samples and determined to be the most probable contaminant by multiple linear regressions).

Requires no previous knowledge of contaminant sources

Allows partial removal of taxa genuinely present in both the sample environment and the contaminating sources

microDecon processes one sample at a time, ignoring information shared across samples

Limited performance when control communities show substantial variability

Partial removal of reads from taxa commonly present in negative controls can have unpredictable consequences for subsequent analyses

SCRuB44

Principle

Contamination from a common source will be introduced at similar proportions across samples.

Approach

SCRuB uses a probabilistic approach to model observed data likelihood by estimating the sample composition, shared contamination sources, the proportion of each sample from contamination and the spatial position of samples during processing. Contaminating taxa are proportionally removed from the samples to maximize the data likelihood.

Requires no previous knowledge of contaminant sources

Model includes cross-contamination among samples and spatial location (for example, location on a 96-well plate) of a sample during processing

Allows partial removal of taxa genuinely present in both the sample environment and the contaminating sources

Allows accounting for multiple contamination sources, including across different batches

Maximizing utility requires sufficient information regarding batches and well locations

Partial removal of reads from taxa commonly present in negative controls can have unpredictable consequences for subsequent analyses

Squeegee82

Principle

Taxa from a common contamination source will be introduced together to the samples at similar proportions.

Approach

Squeegee predicts contaminants in taxonomically classified reads by evaluating taxa prevalence across samples, pairwise similarity between samples and coverage of reference genome of the candidate contaminant species. Contaminating taxa are completely removed from the dataset.

Requires no previous knowledge of contaminant sources

Can be applied even when negative controls are not available or insufficient

Designed for decontaminating metagenomic datasets

May allow the detection of batch-specific or cross-contaminants by analysing individual sample batches independently

Limited performance outside abundant contaminating taxa

Requires multiple sample groups with highly dissimilar communities that are exposed to the same potential contaminants

The model does not account for situations when a taxon is both a contaminant and a genuine community member