Extended Data Fig. 2: Self-supervised embedding of multiple data modalities. | Nature

Extended Data Fig. 2: Self-supervised embedding of multiple data modalities.

From: Multimodal cell maps as a foundation for structural and functional genomics

Extended Data Fig. 2

a) Architecture of self-supervised multimodal embedding model. Columns of squares represent feature vectors, with the dimensionality written just below each column. Regions enclosed by dotted lines represent neural networks with layers described. Protein coordinates in the joint multimodal embedding (z) are used for computing pairwise protein-protein similarities in subsequent panels (cosine similarity function). b) Distribution of similarities shown for protein pairs with a ‘high-confidence interaction’ denoted in the STRING database (green) in comparison to all other protein pairs (grey). c) Similar to (b) but for protein pairs in the same CORUM complex. d) Similar to (b) but for protein pairs that yield highly similar transcriptional profiles (top 1% pairs) when genetically disrupted by CRISPR, drawn from a recent perturb-seq functional genomics study80. **** denotes significant difference, p < 0.0001 by one-sided Wilcoxon rank-sum test. e) Different protein embedding approaches (coloured points, Methods) are evaluated by their degree of enrichment (x-axis) across orthogonal functional and physical interaction resources (y-axis, resources from panels b-d above). Supervised Random Forest trained using the Gene Ontology (Methods). Enrichment computed using Cliff’s Delta (1,000 samplings of 1,000 protein pairs with replacement, Methods) yielding values in range [–1,1], with positive values indicating enrichment above random expectation. Error bars denote standard deviations across 1000 bootstrap resamplings with the center at the mean. * denotes significant difference in comparison with self-supervised multimodal embedding results (two-tailed p < 0.05 across bootstrap resamplings).

Back to article page