Supplementary Figure 6: Outliers in ProteomeHD and their impact on coexpression metrics.
From: Co-regulation map of the human proteome enables identification of protein functions

(a) Co-regulated protein pairs in ProteomeHD were divided into those detected by treeClust but not by PCC and vice versa. Separate comparisons were made for pairs detected by treeClust but not rho, and treeClust but not bicor. The pairs in the resulting groups were annotated using Reactome into known, biologically relevant interactions (true positives) and pairs that were unlikely to have any biological associations (false positives). Note that treeClust-specific pairs tend to be true positives, whereas correlation-specific pairs tend to be false positives. (b) This panel complements Fig. 2f. Outliers were detected in ProteomeHD via their Mahalanobis distance, i.e. these outliers are located far from the bulk of the data, but can be close to the regression line. The boxplots show that Mahalanobis outliers are more frequent in protein pairs detected specifically by rho or bicor as opposed to pairs detected specifically by treeClust. The number of protein pairs shown corresponds to n for each group as indicated in (a). (c) Removing these Mahalanobis outliers has little impact on the PCC of treeClust-, rho- or bicor-specific protein pairs, in contrast to what was observed for Pearson’s correlation (see Fig. 2g). For number of proteins shown, see panel (a). (d) A second type of outlier - regression outliers - were detected in ProteomeHD via studentized residuals. These outliers are located far away from the regression line and will decrease correlation coefficients. An example of a true association is shown, where regression outliers affect the resulting correlation. Fold-changes have been scaled to lie between 0 and 1. (e) The percentage of regression outliers is very similar in all six groups. See panel (a) for number of proteins shown. (f) Removing regression outliers increases the correlation coefficient (PCC) of protein pairs that were previously detected only by treeClust, suggesting PCC missed some of these pairs because of regression outliers. This is not the case for pairs missed by rho or bicor. See panel (a) for number of proteins shown. For boxplots, lower and upper hinges correspond to the first and third quartiles, and lower and upper whiskers extend to the smallest or largest value no further than 1.5 * IQR (inter-quartile range) from the hinge, respectively. Notches give roughly a 95% confidence interval for comparing medians.