Fig. 1: Development of PICNIC (Proteins Involved in CoNdensates In Cells) algorithm. | Nature Communications

Fig. 1: Development of PICNIC (Proteins Involved in CoNdensates In Cells) algorithm.

From: PICNIC accurately predicts condensate-forming proteins regardless of their structural disorder across organisms

Fig. 1

a In order to construct a training dataset, we annotated the known condensate-forming proteins from CD-CODE34 (positive dataset, members of biomolecular condensates) on the protein-protein interaction (PPI) network, and we excluded their first connections (proteins having interactions with condensate proteins). The remaining proteins comprised the negative dataset. Gradient boosting machine was used to distinguish two classes of proteins: members of biomolecular condensates and proteins that are not involved in any type of biomolecular condensate. b Sequence, structure and function-based features of PICNIC. Sequence-based features included sequence complexity, disorder score (IUPred), and features based on amino acid co-occurrences. Structure-based features based on AlphaFold2 models included the pLDDT score, a per-residue measure of local confidence on a scale from 0 to 100 (colored on the structure). We annotated the secondary structure (SSE) based on 3D protein structures using STRIDE and all possible triads in the form (AA, SSE, pLDDT) were calculated. c Amino acid occurrences in the features of PICNIC model show that Leucine and Lysine contribute most to the model predictions. d Feature importance of PICNIC is consistent across different folds (N = 10). The boxes show the quartiles of the dataset, where first black horizontal line of the rectangle shape is first quartile or 25% the second black horizontal line is the second quartile or median, the third black horizontal line is third quartile or 75%. The whiskers extend to points that lie within 1.5 IQRs (interquartile range) of the lower and upper quartile, the outliers are displayed as circles. Features constitute four groups: based on AlphaFold2structures (light blue), disorder (pink), complexity (dark red) and amino acid co-occurences (blue). Source data are provided as a Source Data file.

Back to article page