Fig. 2: Comparison of lasso peptide substrate tolerance prediction using different embeddings.
From: LassoESM a tailored language model for enhanced lasso peptide property prediction

A Core peptide sequence of fusilassin overlaid with substrate tolerance information. The n value is the number of sequences tested at the indicated position(s) by CFB, with the percentage of n sequences tolerated by the biosynthetic enzymes given. Bold indicates the macrolactam-forming residues. B AUROC score and balanced accuracy of the SVM model for the fusilassin variant dataset trained on embeddings from one-hot encoding (blue), VanillaESM (pink), PeptideESM (yellow), and LassoESM (green). Data are presented as mean \(\pm \,\)SD. n = 5 independent repeats of 10-fold cross-validation. VanillaESM vs. PeptideESM: AUROC, p = 0.0377; Balanced Accuracy, p = 0.0012. VanillaESM vs. LassoESM: AUROC, p = 0.0006; Balanced accuracy, p = 0.0119. p-value was calculated using a two-sided t-test. *p < 0.05, ** p < 0.01, *** p < 0.001. C Core peptide sequence of MccJ25 overlaid with substrate tolerance information. Each position was site-saturated with all other proteinogenic amino acids (n = 20) prior to heterologous expression7. The percentage of sequences that were cyclized and exported is given. Bold indicates the macrolactam-forming residues. D Same as (B), except for MccJ25. Data are presented as mean \(\pm \,\)SD. n = 5 independent repeats of 10-fold cross-validation. VanillaESM vs. PeptideESM: AUROC, p = 0.8759; Balanced accuracy, p = 0.7751. VanillaESM vs. LassoESM: AUROC, p = 0.0006; Balanced Accuracy, p = 0.0013. p-value was calculated using a two-sided t-test. E AUROC score and balanced accuracy of the SVM model trained on a fraction of the fusilassin variant training data using LassoESM embeddings and evaluated on the remaining data. Data are presented as mean ± SD. n = 10 random seeds, with a fraction of the dataset randomly selected for training in each seed. F Model accuracy at different predicted probabilities of the SVM model trained on the fusilassin variant dataset using LassoESM embeddings. Data are presented as mean ± SD. n = 10 random seeds. For each random seed, 80% of the fusilassin variant dataset was used for training and the remaining 20% for testing. The source data are available in the Source data file.