Fig. 2: General features of generated AMP candidate sequences.

The inverted pyramid diagram illustrates the process of refining the initial candidate peptide sequences into AMP candidates for the a seq-based, b MSA-based, and c MSA-conditional groups. The ‘Clean sequence’ denotes the initial pool of sequences after ambiguous amino acids have been removed, with subsequent steps showing the percentage of sequences remaining after each conditional screening. The ‘Physical properties’ refer to sequences filtered with physical properties (with a net charge greater than 0 at pH 7 and a hydrophilic amino acid ratio between 40% and 70%). The ‘XGBoost’ refers to sequences identified as AMPs by the XGBoost-based discriminator. The accompanying Venn diagram shows AMP candidates with predicted activity against Escherichia coli and Staphylococcus aureus, defined by a predicted MIC value of less than 5 μM using the LSTM-based scorer. d The proportion of each amino acid in the AMP candidates that passed the discriminator. The distribution of physicochemical properties for candidate AMP sequences includes e sequences length, f net charge, g isoelectric point, h hydrophobic moment, i Boman index, and j instability index for AMP sequences classified as AMPs by the discriminator. The MSA-conditional model consistently generates a higher number of sequences with diverse physicochemical properties compared to the Seq-based and MSA-based models, indicating enhanced variability and potential functional diversity in the MSA-conditional generated peptides. Panels show violin plots depicting the distribution of predicted minimum inhibitory concentration (MIC) values (μM, log10-transformed) against E. coli (k) and S. aureus (l). The number of sequences evaluated for each model is indicated at the bottom of each violin plot. The MSA-conditional model generated sequences with significantly lower MIC values, indicating higher antibacterial activity against both E. coli and S. aureus compared to the Seq-based and MSA-based models. Statistical significance is indicated by asterisks (***), representing P < 0.001 (Kruskal-Wallis multiple comparison, with P-values adjusted using the Benjamini-Hochberg method).