Fig. 3: ProtBFN sequences show broad coverage of the training distribution in embedding space of a protein language model.
From: Protein sequence modelling with Bayesian flow networks

To visualise distributions of protein sequences, the mean embedding of the ESM-2 model is calculated for 10 000 samples from each of ProtBFN, ProtGPT2 and EvoDiff and projected into two dimensions using the UMAP algorithm87. The projection is calculated using the union of both the UniProtCC and UniRef50 training distributions, and each method is overlaid with its respective training distribution. Source data are provided as a Source Data file.