Fig. 1: Overview and evaluation of the PepMLM model.
From: Target sequence-conditioned design of peptide binders using masked language modeling

a, The architecture of the PepMLM model. Based on the finetuning of ESM-2, the model incorporates the target protein sequence along with a masked binder region during the training phase. During the generation phase, the model can accept target protein sequences and mask tokens to facilitate the creation of peptides of specified lengths. b, Perplexity distribution comparison. The perplexity values were calculated for test and designed peptides, encompassing the target proteins in the test set. c, The density distribution visualization of the log perplexity values for target–peptide pairs, encompassing test peptides, PepMLM-650M-designed peptides, ESM-2-650M-designed peptides and random peptides. d, In silico hit rate assessment of RFdiffusion (left) and PepMLM (right). Using AlphaFold-Multimer, ipTM scores were computed for both the designed and test peptides in conjunction with the target protein sequence. The entries are organized in accordance with the ipTM scores attributed to the test set peptides. The hit rate is characterized by the designed peptides exhibiting ipTM scores ≥ those of the test peptides. e, Binding specificity analysis through permutation tests. The distribution of PPL scores for matched target–binder pairs (blue) is compared with randomly shuffled mismatched pairs (red). Each target’s binder was shuffled 100 times to generate the mismatched distribution. Statistical significance was determined using t-test (P < 0.001). f, Structural comparison of computationally designed and experimental peptide binders in complex with their target proteins. Target proteins (gray) are shown in complex with PepMLM-designed binders (red) and experimental test binders (blue), with contact residues highlighted in corresponding colors. Top, mouse H-2Kb MHC complex (PDB ID: 2OI9) with designed peptide PSLGSVPYV (ipTM: 0.9) and test peptide QLSPFPFDL (ipTM: 0.9). Bottom, human tyrosine kinase complex (PDB ID: 1LCK) with designed peptide PPAEEIPP (ipTM: 0.82) and test peptide EGQQPQPA (ipTM: 0.68). g, Frequency distribution of individual amino acids among peptide binders (n = 203), comparing the test set (blue), PepMLM-designed sequences (red) and ESM2-650M-designed sequences (green). h, Amino-acid-specific generation distribution at contact positions (8-Å threshold). The heatmap shows the percentage of designed amino acids (y axis) given each amino acid in test binders (x axis).