Fig. 2: Predicted effects of observed amino acids using an IND model (neglecting epistasis) or a DCA model (incorporating pairwise epistasis).
From: Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes

a Rank of native amino acid in the reference strain as compared to all 20 possible amino acids. DCA model (red) outperforms IND (yellow) by predicting twice as many native amino acids to be the best possible. b DCA rank of major and minor allele for all sites that are polymorphic at a >5% threshold, among all 20 possible amino acids. Major alleles (alleles at frequencies >50%, in red) have better ranks than minor alleles (alleles at frequencies between 5 and 50%, in pink). The distribution of consensus alleles peaks at the first rank (46.2% of polymorphic sites have major allele ranking first and 17.6% have second-best rank) while the distribution of minor alleles peaks at the second rank (13.3% have the best rank against 17.6% that are second-best). c IND rank of major and minor allele for all sites that are polymorphic at a >5% threshold, among all 20 possible amino acids. As with DCA, major alleles (in orange) have better ranks than minor alleles (in yellow) and the distribution of consensus alleles peaks at the first rank. However, the distribution is spread towards greater ranks (only 24.1% of polymorphic sites have major allele ranking first and 15.5% have second-best rank, similarly minor alleles rank first in 9.6% and second-best in 13.3% of polymorphic sites) compared to DCA ranking. d Distribution of DCA scores of non-synonymous polymorphisms observed at frequencies >5% across the >60,000 strains (blue) compared to mutations sampled from an IND model (yellow) or to random mutations (gray). A large number of possible mutations are predicted to be highly deleterious (positive scores) compared to naturally occurring polymorphisms that tend to be neutral (blue distribution centered on zero). Polymorphisms predicted from IND are slightly deleterious once epistasis is taken into account (yellow distribution shifted towards positive values). Boxplot center lines represent medians, box limits are upper and lower quartiles, whiskers extend to show the rest of the distribution within an 1.5 × interquartile range, outliers are represented with points; sample size is 3477 mutations for each of the three groups.