Fig. 2: Summary of Deep Novel Mutation Search (DNMS).

DNMS starts with ① an input sequence being fed into ProtBERT ②. From ProtBert DNMS extracts ③ the attention matrix A; ④ a protein semantic embedding Z; and ⑤ output posterior probability. DNMS calculates ③' Attention Change, ④' Semantic Change and ⑤' Grammaticality for every single point amino acid substitution for the input sequence. In this example, two mutations are visualized at position i = 4, where the input sequence has token L, xi = L. The two mutations are L4A and L4E, denoted by \({\tilde{x}}_{i}\). ⑤' Grammaticality, denoted with \(p({\tilde{x}}_{i}| {{{{\bf{X}}}}}_{k})\), for the two mutations are calculated from the posterior probability output from ProtBERT,⑤. Grammaticality is a measure of statistical patterns learned from the fine-tuned ProtBERT model. For each mutation, we pass into ProtBERT the mutated sequence, \({\tilde{{{{\bf{X}}}}}}_{k}[{\tilde{x}}_{i}]\) which represents the input sequence with the introduced mutation at position i. ③' We obtain the attention matrix for the mutated sequence, \({{{\bf{A}}}}[{\tilde{x}}_{i}]\), and calculate Attention Change (change from A), ΔA, which is a measure of similarity. ④' We obtain a protein semantic embedding for the mutated sequence, \({{{\bf{Z}}}}[{\tilde{x}}_{i}]\), and calculate Semantic Change (change from Z), ΔZ, which is an additional measure of similarity. ⑦ DNMS combines the rankings of Semantic Change, Grammaticality, and Attention Change; prioritizing high Grammaticality, and low Semantic Change and Attention Change. Future novel mutations are discovered using \({\mathtt{DNMS}}({\tilde{x}}_{i};{{{{\bf{X}}}}}_{k})\).