Fig. 4: PST expands functional annotation of hypothetical proteins. | Nature Communications

Fig. 4: PST expands functional annotation of hypothetical proteins.

From: Protein Set Transformer: a protein-based genome language model to power high-diversity viromics

Fig. 4: PST expands functional annotation of hypothetical proteins.

a, b Structural alignments with the HK97 major capsid protein (PDB: 2FS3, gray) for a protein annotated by VOG as unknown (a, “IMGVR_UViG_3300038749_000016 | 3300038749 | Ga0423190_00012 | 260998-304157_27”) and another undetected by VOG (b, “IMGVR_UViG_2687453601_000002 | 2687453601 | 2687454426 | 1161639-1205965_50”). The red cartoon diagrams are the query proteins from our dataset and were chosen due to being the most similar to the HK97 capsid protein (2FS3) from each category. c, d The proportion of proteins from the IMG/VR v4 (c) and MGnify (d) test datasets unannotated by VOG clustering with annotated capsid proteins that have detectable structural homology with known capsid folds. Structural homology was detected using foldseek searching against the Protein Data Bank database. e–h The proportion of proteins unannotated by VOG whose nearest neighbors in embedding space are annotated. The colors indicate the protein/ORF embedding. Nearest neighbors were searched using angular similarity after L2-normalizing the protein embeddings. e, g A hit was considered for each unannotated protein if any of the neighbors less than or equal to the current number of nearest neighbors were annotated. f, h A hit was considered similarly to (e, g) with the additional constraint that all of the current set of nearest neighbors must belong to the same VOG functional category. Unannotated proteins were not used to penalize the score. The rows indicate the test set used: IMG/VR v4 (e, f) and MGnify (g, h).

Back to article page