Fig. 3: ProtBFN sequences show broad coverage of the training distribution in embedding space of a protein language model. | Nature Communications

Fig. 3: ProtBFN sequences show broad coverage of the training distribution in embedding space of a protein language model.

From: Protein sequence modelling with Bayesian flow networks

Fig. 3: ProtBFN sequences show broad coverage of the training distribution in embedding space of a protein language model.The alternative text for this image may have been generated using AI.

To visualise distributions of protein sequences, the mean embedding of the ESM-2 model is calculated for 10 000 samples from each of ProtBFN, ProtGPT2 and EvoDiff and projected into two dimensions using the UMAP algorithm87. The projection is calculated using the union of both the UniProtCC and UniRef50 training distributions, and each method is overlaid with its respective training distribution. Source data are provided as a Source Data file.

Back to article page