Fig. 1: The Protein Set Transformer (PST) architecture and training regime. | Nature Communications

Fig. 1: The Protein Set Transformer (PST) architecture and training regime.

From: Protein Set Transformer: a protein-based genome language model to power high-diversity viromics

Fig. 1: The Protein Set Transformer (PST) architecture and training regime.

a General overview of the graph-based PST for learning genome representations from contextualized protein embeddings. Each protein is represented by an ESM2 protein embedding. PST internally represents each genome as a graph, consisting of multiple subgraphs of fully connected, locally adjacent proteins. The size of each subgraph is a tuned hyperparameter. PST uses multi-head attention both to contextualize protein embeddings within each genome and to learn per-protein weights for a weighted averaged over each genome. See Supplementary Fig. 1 for a modeling-centric view of PST. Both protein and genome representations can be used for an appropriate downstream task. b Triplet mining workflow that includes the data augmentation technique c PointSwap sampling. For each training genome, a positive genome is identified from the ESM2 embedding space defined as the minimum Chamfer distance. Then, a negative, less related, genome is chosen from the PST embedding space that is the next farther genome after the positive. We augment our training data by creating hybrid genomes that swap similar protein vectors between each genome and its positive genome. d Pictorial representation of the triplet loss objective function used to train PST on viral genomes. The operational objective of triplet loss is to embed each genome and its positive genome closer in embedding space than each genome and its negative genome, within a tunable distance margin.

Back to article page