Fig. 1 | Scientific Data

Fig. 1

From: SARS-CoV-2 receptor-binding domain deep mutational AlphaFold2 structures

Fig. 1

Sketch of protein representations and their projections. Sequence space 𝔽, structure space 𝕊, adjacency matrix space 𝔸, phenotype space are the spaces of all possible proteins for a given representation. 𝕊 contains 𝔽 and 𝔸, formally, 𝔽, 𝔸 𝕊. Each protein has a FASTA one-hot-encoded representation F 𝔽, a PDB file S 𝕊, an adjacency projection of the PDB file A 𝔸 and some measured phenotypic properties (function) P . We compare the projections 𝔽 and 𝔸 with respect to how a model f learns from these representations to make predictions about . (a) Predict structure with AlphaFold2.(b) Learning to predict protein-protein binding affinities from FASTA sequences. In the limit of huge amounts of genomic and phenotype data, this may even build such a rich internal representation of protein interaction dynamics that explicit structure modeling (the top path of the loop) is not required41. (c) Creation of adjacency matrices from PDB structures. Representations in A carry no chemical information so can be used to analyze if the AF2 projection to S actually captured geometric signal that can be leveraged for phenotype prediction tasks, this representation has the added advantage of being rotation agnostic. (d) Learning to predict protein-protein binding affinities with adjacency matrices. (e) 𝕊 representations in PDB contain both chemical and geometrical information. An end goal could be to use this representation to build predictive models to predict in a similar fashion to previously proposed methods42,43,44. However, this pathway is only worth using if we validate that (d) is possible to some extent.

Back to article page