Fig. 2: Definition of a nuclear reference proteome.

a Protein correlations with histones H3.1 and H4 across cancer cell lines. The scatter plot shows the Pearson correlation coefficients (R) of proteins vs. H3.1 (x axis) and H4 (y axis). Each dot represents a protein covered by the CCLE proteome dataset. Proteins that correlate strongly with both histones (R > 0.5) are colored in red. b Gene ontology cellular compartment analysis of top histone correlating proteins from (a). Significantly over-represented cellular components were sorted by the number of top histone correlators localized to the respective component. The blue bars show the number of top histone correlators covered by component. The red bars show the cumulative sum of the top histone correlators covered. c Mutual correlations between top histone correlating proteins from (a). The heatmap represents Pearson correlation coefficients of protein levels, organized by hierarchical clustering. An enlarged version with protein labels is available in Supplementary Fig. 2. d Correlation of protein vs. mRNA levels across cancer cell lines. The y axis indicates the Pearson correlation coefficient of each protein in the CCLE proteomics dataset vs. its encoding mRNA. The x axis indicates the mean Pearson correlation coefficient of each protein vs. histones H3.1 and H4. Top histone correlating proteins from (a) are colored in red. The red line represents the quantile regression. e Principal component analysis of top histone correlator co-expression. Pearson correlation coefficients from c were simplified by PCA. The PCA correlation plot shows the first two principal component correlations. Each dot represents one top histone correlating protein. Dot positions reflect the clustering behavior in c. Colors indicate the protein classification by complex and function. f Selection of proteins for the representative nuclear index. The y axis indicates the protein-mRNA correlation (same as the y axis in d). The x axis indicates the first principal component of top histone correlator co-expression (same as the x axis in e). Proteins with low protein-mRNA correlation (R < 0.3) and a similar co-expression spectrum (PC1 > 0.5), marked in red, were included in the nuclear index. g Components of the representative nuclear index. Proteins selected in (f) were clustered by STRING and manually grouped into 10 categories with Cytoscape. h Lineage variation of nuclear index components. Mean levels of each protein were calculated by cell lineage. The relative variation between lineages is represented as the standard deviation of lineage means divided by the mean of lineage means. i Scheme for the calculation of the nuclear index. To ensure a balanced contribution of all nuclear index protein categories, the nuclear index is calculated in two steps: first, individual protein expressions are converted to representative category values. Second, the representative category values are integrated. This method eliminates the effect of potential large expression changes of individual proteins by using median values across diverse protein categories. Cell lines covered by both CCLE RNA-Seq and proteomics datasets (N = 373) were used. Protein expression data were lineage-centered.