Fig. 2: Contextualized protein embedding analysis and comparison with concepts in natural language modeling.
From: Genomic language model predicts protein co-regulation and function

A A word upon contextualization can be mapped to embedding space. For many words, the semantic meaning varies in different types of literature, and therefore their contextualized embeddings cluster with source text type. Figure was created for qualitative visualization. B The input protein embedding (output of ESM2 and context-free protein embedding) is the same across all occurrences of the protein in the database. Upon contextualization with gLM, contextualized protein embeddings of the same protein (last hidden layer of gLM at inference time) cluster with biome type, analogous to the source text type in natural language (A). Contextualization of 30 other multi-biome MGYPs can be found in Supplementary Fig. 3. C A word’s meaning upon contextualization varies across a continuous spectrum and can be ambiguous even with contextualization (e.g. double entendre). D Reaction 1, carried out by the MCR complex, either backward (Methanotrophy) or forward (Methanogenesis). E Principal Component Analysis (PCA) of context-free protein embeddings of McrA sequences in genomes (total explained variances = 0.56), colored by metabolic classification of the organism (ANME, methanogen) based on previous studies and labeled by class-level taxonomy. F PCA of contextualized McrA embeddings (total explained variance = 0.68), where gLM embeddings cluster with the direction of Reaction 1 that the MCR complex is likely to carry out. G Geometric relationship between contextualized protein embeddings based on the semantic closeness of words. H Input (context-free) protein embeddings of Cas1, Cas2, lipopolysaccharide synthases (LPS) and polyketide synthases (PKS) showing clustering based on structural and sequence similarity. I Clustering of contextualized protein embeddings where phage defense proteins cluster (Cas1 and Cas2) and biosynthetic gene products cluster (lipopolysaccharide synthases [LPS] and polyketide synthases [PKS]). Source data are provided as a Source Data file.