AMPGen: an evolutionary information-reserved and diffusion-driven generative model for de novo design of antimicrobial peptides

Jin, Shuwen; Zeng, Zihan; Xiong, Xiyan; Huang, Baicheng; Tang, Li; Wang, Hongsheng; Ma, Xiao; Tang, Xiaochun; Shao, Guoqing; Huang, Xingxu; Lin, Feng

doi:10.1038/s42003-025-08282-7

Download PDF

Article
Open access
Published: 30 May 2025

AMPGen: an evolutionary information-reserved and diffusion-driven generative model for de novo design of antimicrobial peptides

Shuwen Jin¹^na1,
Zihan Zeng^1,2^na1,
Xiyan Xiong ORCID: orcid.org/0000-0002-0240-5971¹^na1,
Baicheng Huang¹,
Li Tang³,
Hongsheng Wang¹,
Xiao Ma¹,
Xiaochun Tang⁴,
Guoqing Shao⁴,
Xingxu Huang ORCID: orcid.org/0000-0001-8934-1247⁵ &
…
Feng Lin ORCID: orcid.org/0000-0002-1199-5870^4,6

Communications Biology volume 8, Article number: 839 (2025) Cite this article

5278 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The rapid advancement of artificial intelligence (AI) has enabled de novo design of functional proteins, circumventing the reliance on natural templates or sequencing databases. However, current protein design models are ineffective in generating proteins without stable structures, such as antimicrobial peptides (AMPs), which are short and structurally flexible yet play critical biological roles. To address this challenge, we present AMPGen, an evolutionary information-reserved and diffusion-driven generative model for de novo design of target-specific AMPs. AMPGen innovates AI tools, including a generator, a discriminator, and a scorer, along with biochemical knowledge-based screening programs. The generator employs a pre-trained, order-agnostic autoregressive diffusion model, which performs axial attention to capture protein evolutionary information from multiple sequence alignments (MSAs). The AMP-MSA conditional input raises the success rate of generated AMPs, which are subsequently filtered based on physicochemical properties and assessed by an XGBoost-based discriminator. The final target-specific scoring is performed with an LSTM-based scorer, resulting in high-quality AMP candidates. In this study, of the 40 de novo designed AMP candidates for verification, 38 were successfully synthesized, and among them, 81.58% demonstrated antibacterial activity. These AMPs designed by AMPGen are absent from existing AMP databases, and exhibit high antibacterial capacity, sequence diversity, and broad-spectrum activity.

A generative artificial intelligence approach for the discovery of antimicrobial peptides against multidrug-resistant bacteria

Article Open access 03 October 2025

Machine learning for antimicrobial peptide identification and design

Article 26 February 2024

Antimicrobial peptides: structure, functions and translational applications

Article 11 July 2025

Introduction

AI-driven protein design enables the generation of protein sequences that are not found in nature by leveraging deep learning, generative models, and evolutionary principles. This progress in protein design models is poised to revolutionize biomedical technologies, significantly accelerating advancements in the medical healthcare system. Currently, protein generation is predominantly achieved through two types of large protein models, which are trained on an entire protein library. The first type is sequence-based models, which aim to capture the biochemical constraints that characterize the proteins within the training set derived from extensive sequence data, with ProGen being a notable example^1,2,3,4,5. The second is structure-based models, which aim to align the sequence-structure-function relationship of proteins, as represented by models like RoseTTAFold^6,7,8,9. These protein design models have already demonstrated remarkable success in various applications, including enzyme design, optimization, and antibody engineering¹⁰.

Despite the successes in those proteins with stable secondary and tertiary structure, few were in a significant subset of proteins that lack such stable structure, especially peptides and intrinsically disordered proteins. Indeed, modeling short and flexible peptides challenges the conventional sequence-structure-function paradigm¹¹, though the distinction between peptides and proteins is not strict and universally accepted. According to the International Union of Pure and Applied Chemistry (IUPAC), oligopeptides typically contain fewer than 10–20 amino acids, while polypeptides consist of more than 20 residues. Proteins, on the other hand, are generally defined as polypeptides with more than approximately 50 amino acids¹². Some researchers, however, define short peptides as those containing no more than 45 amino acids¹³.

Peptides, including signaling peptides, peptide hormones, neuropeptides, therapeutic peptides, and antimicrobial peptides (AMPs), are prevalent across all life forms and perform critical biological roles despite their lack of stable structure. However, the current generation of protein design models, which often rely on structural information or focus on generating protein backbones, are limited in their ability to effectively address the unique characteristics of structurally unstable peptides. To address this challenge, we developed a model specifically for generating functional protein sequences without defined tertiary structures, using AMPs as a case study.

AMPs are a class of peptides, typically composed of 12–50 amino acids, that can kill bacteria, viruses, and fungi by disrupting biofilms or forming transmembrane channels. Unlike structured proteins, AMPs are inherently disordered and lack a defined tertiary structure, exhibiting a high degree of plasticity¹⁴. AMPs have demonstrated high biocompatibility, a broad antimicrobial spectrum, and do not induce drug resistance, making them promising candidates for clinical translation as new therapeutic agents¹⁵. As a result, they are emerging as potent candidates for innovative therapeutic interventions. While traditional bioinformatics approaches have shown promise in identifying AMPs from genomic database^{16,17,18,19,20}, these methods are constrained by the limitations of existing databases. Although several studies on protein generation models have shown potential in creating uncharacterized functional AMPs, these models often fall short in effectiveness due to their lack of adaptation to the structurally flexible nature of peptides²¹.

In this work, we propose a generative model, AMPGen, for the de novo design of target-specific AMP sequences. It comprises a generator, a discriminator, and a scorer, augmented by necessary biochemical knowledge-based screening. The generator leverages an order-agnostic autoregressive diffusion model that is pre-trained on the OpenFold database (https://registry.opendata.aws/openfold/), incorporating an axial attention mechanism to capture protein evolutionary information in multiple sequence alignment (MSA) format within the latent space. An AMP-MSA dataset is constructed and used as input to enhance the model’s success rate. Compared to baseline models, incorporating evolutionary information enhances the model’s learning capability. Considering both the synthesis cost and potential applications, we define the length of the generated sequences to range from 15 to 35 amino acids. The generated sequences are then sequentially filtered based on their physicochemical properties and evaluated with an XGBoost-based discriminator. Finally, target-specific scoring is conducted using an LSTM-based scorer, ultimately yielding the final AMP candidates. Experimental validation demonstrates that AMPGen possesses a distinctive and highly efficient capability for AMP generation, achieving an 81.58% positive rate, and producing AMPs that, to the best of our knowledge, have not been previously reported in existing protein databases.

Results

The architecture for de novo AMP design

To address the unique challenges posed by the short length, high diversity and inherent flexibility of short peptide sequences²², the AMPGen architecture possesses several key innovations (Fig. 1). Central to AMPGen is a cascade model consisting of a generator, a discriminator and a scorer, each contributing distinct dimensions of AMP-specific information to enhance the learning process and provide a comprehensive understanding of AMP characteristics. The generator is initially trained on a large, universal protein database to learn the fundamental patterns of protein sequences. To refine the model’s focus on AMP, we employed a dataset enriched with AMP evolutionary information as input and incorporated an axial attention mechanism to enhance the model’s learning capabilities. The generated sequences are subsequently filtered using a discriminator based on a binary XGBoost classifier, followed by target-specific scoring via an LSTM regression model. The discriminator and scorer employ masking techniques for feature extraction and a language model embedding approach, respectively. Their combined application enhances the performance of AMPGen. Furthermore, to accommodate the structurally flexible nature of AMPs, the generator is specifically designed to rely solely on one-dimensional sequence data without incorporating structural data. The order-independent diffusion model is employed as the generator due to its ability to produce a wider diversity of results.

**Fig. 1: Overview of AMPGen for de novo AMP sequence design.**

Firstly, an AMP-MSA dataset containing evolutionary information is constructed as the input of the model (MSA-conditional generation). For each sequence in the AMP dataset, we generated an MSA by searching the UniClust30 database with HHblits²³. Aside from this, the only required input for sequence generation by the model is the specified length range of the desired sequence. Considering the known properties of AMPs and the cost of synthesis, we define the length of the generated sequences to be between 15 to 35 aa²⁴. To evaluate the effectiveness of our approach, we employed two other generation methods as baselines for comparisons: a generation method based solely on protein sequences (seq-based generation), and a method based on an evolutionary-scale dataset of MSAs without model input (MSA-based generation). AMPGen employs an order-agnostic autoregressive diffusion model, pre-trained on the entire protein sequence database, for the generation of short peptide sequences²⁵, while the seq-based generation model adopts a ByteNet-style CNN architecture²⁶. The diffusion model is a generative simulation technique that has demonstrated success in both image and text generation. These models are capable of producing highly diverse outputs and can be conditioned on input data, making them well-suited for the generative modeling of peptides²⁷. By comparing our model with the baselines, we can directly assess the impact of incorporating sequence evolutionary information and conditional datasets on enhancing the model’s ability to design sequences.

After generating millions of initial candidate sequences, we filtered out those containing ambiguous amino acids (indicated by U, O, B, Z, J, or X), resulting in a set of clean sequences. To further refine the dataset, we retained only sequences with a net positive charge (net charge >0 at pH 7) and a hydrophobic amino acid proportion between 40% and 70%²⁴. These physicochemical criteria are characteristic of AMPs and are crucial for their activity. An XGBoost-based discriminator is then built to determine whether each candidate sequence is an AMP. The discriminator employs an embedding approach utilizing various feature extraction methods^28,29,30. For sequences classified as AMPs, we trained an LSTM regression model to predict their minimal inhibit concentration (MIC) values against target species. Specifically, Gram-negative Escherichia coli and Gram-positive Staphylococcus aureus were selected as target species for subsequent wetlab validation. The LSTM regression model utilizes a protein language model-based embedding technique, specifically ESM2-t36-3B⁶. ESM-2 is a transformer-based model designed to capture complex relationships across protein sequences. The embeddings generated by ESM-2 represent features for each residue, enabling the LSTM to learn dependencies and patterns over the sequence. Regarding model performance, the XGBoost discriminator achieved an F1 score of 0.96, an accuracy of 0.96, and a recall of 0.95. Model performance was assessed using ten-fold cross-validation and ROC analysis, yielding an average area under the curve (AUC) of 0.99 (Fig. S1). The LSTM model for predicting MIC values achieved an R-squared value of 0.89 on the validation set for E. coli and 0.86 for S. aureus (Fig. S2).

High-throughput generation of candidate AMPs

In the de novo design of AMPs, we initially generated a total of 70,000 raw sequences using the MSA-conditional generation method. For baseline comparisons, we also generated 70,000 sequences using the MSA-based generation and 50,000 sequences using the seq-based generation. After filtering out sequences containing ambiguous amino acids, we obtained 59,944 clean sequences from the MSA-conditional generation method, 47,511 from the MSA-based generation and 49,999 from the seq-based generation method (Fig. 2a–c, Table S1). It is important to note that the number of generated sequences was determined by the experimental setup. All subsequent comparisons between the generation strategies and baseline models are therefore based on relative ratios of sequences.

**Fig. 2: General features of generated AMP candidate sequences.**

To assess whether the generator produced short peptide sequences with AMP-like characteristics, we analyzed the physical-chemical properties and amino acid composition of the clean sequences. All three groups of generated sequences exhibited positive charges and high isoelectric points, similar to validated AMPs in public databases. However, these properties did not significantly distinguish them from non-AMP sequences (Fig. 2e–j, Data S1). Regarding amino acid composition, AMPs are reported to contain higher contents of positively charged amino acids such as lysine, arginine, and histidine, as well as hydrophobic amino acids, which form the structural basis for their biological activities and antibacterial effects³¹. Our result indicated that the AMP dataset had higher proportions of lysine (16.25% ± 13.4% in AMPs vs. 8.41% ± 6.9% in nonAMPs) and leucine (12.54% ± 12.4% in AMPs and 8.73% ± 6.2% in nonAMPs), with fold changes of 1.93 and 1.44, respectively (Fig. 2d, Data S1). The model successfully learned these characteristics from AMP sequences and their evolutionary information, resulting in generated sequences with a lysine proportion of 12.07% ± 8% and a leucine proportion of 9.65% ± 6.9% (Data S1).

Evolutionary information and conditional input reserved AMPs

To assess whether incorporating evolutionary information and conditional input in the generator enhanced its ability to generate functional peptides, we compared the MSA-conditional generation with two baselines. Sequences that passed the discriminator were considered AMP candidates. From the MSA-conditional generation, we obtained a total of 28,439 AMP candidates, while the MSA-based generation yielded 7,608 AMP candidates, and the seq-based generation produced 3,396 AMP candidates. These figures correspond to 47.44%, 16.01%, and 6.79% of the clean sequences in each group, respectively (Fig. 2a). The MSA-conditional generation approach demonstrated a higher success rate based on model predictions compared to the baselines that rely solely on sequence databases or lack an AMP-MSA database as a condition. This result indicates that the generative model has effectively learned and incorporated the evolutionary information encoded within the AMP-MSA dataset, improving its capability to design functional AMPs. This conclusion is derived from calculated physical property screening and predictions generated by the XGBoost classifier, rather than direct experimental validation, which still provided a certain degree of explanatory power.

AMPGen is designed to generate AMP candidates against specific antimicrobial targets. The commonly used Gram-negative target Escherichia coli and Gram-positive target Staphylococcus aureus were selected as representative species. The LSTM-based scorer was employed to rank all sequences previously identified as AMPs based on their MIC values. Using a threshold MIC value of less than 5 μM, the scorer determined that 3.88% of the sequences generated by the MSA-conditional method were potent against E. coli and 2.15% were potent against S. aureus (Fig. 2a). This generation success rate was higher than the MSA-based generation baseline, which yielded 0.32% anti-E. coli AMPs and 0.2% anti-S. aureus AMPs. The seq-based generation method produced even lower pass rates, with only 0.04% anti-E. coli AMPs and 0.01% anti-S. aureus AMPs. An analysis of the physical-chemical properties of the AMP candidates (Fig. S3–S5) revealed that the hydrophobicity of sequences generated by the MSA-conditional method was generally higher (mostly distributed at values greater than 0) compared to those generated by the seq-based and MSA-based methods (mostly distributed at values less than 0). Interestingly, although the raw sequences across all generation methods were uniformly distributed in length between 15 and 35 amino acids, the AMP candidates generated by the seq-based and MSA-based methods predominantly clustered below 20 amino acids. In contrast, the AMP candidates generated by the MSA-conditional method tended to be longer, with the majority exceeding 20 amino acids (Fig. 3b, Data S1).

**Fig. 3: Validation of AMP candidates.**

To validate the antibacterial activity of the AMPs designed by AMPGen, 20 sequences were randomly selected from the top 100 candidates targeting E. coli and S. aureus, respectively (Data S2). These sequences were chemically synthesized and subsequently subjected to antibacterial performance assays.

A Wet-lab validation protocol

To confirm the antibacterial activity of the AMP sequences generated by AMPGen, wet-lab antibacterial assays are conducted for experimental validation. Out of the 40 selected AMP candidates, 38 were successfully chemically synthesized —18 targeting S. aureus and 20 targeting E. coli —resulting in a 95% synthesis success rate (Data S3 and Table S2). We determined the MIC values of the synthetic AMPs against the common pathogens E. coli (K88) and S. aureus (ATCC 29213), using Ampicillin and Polymyxin B as positive controls. Of the 38 synthesized candidates, 31 exhibited antibacterial effects (MIC ≤ 75 µM against S. aureus or E. coli), achieving an 81.58% positive design rate. Specifically, 23 candidates showed MICs of ≤25 µM against E. coli, and 11 showed MICs of ≤25 µM against S. aureus (Fig. 3a and Data S3). In detail, 19 out of 20 anti-E. coli candidates and 8 out of 18 anti-S. aureus candidates demonstrated antibacterial activity (Fig. 3a and Data S3). Overall, the success rate for designing target-specific AMPs was 95% for Gram-negative bacteria (E. coli) and 44.4% for Gram-positive bacteria (S. aureus). Among the validated AMP candidates, 9 exhibited inhibitory effects against both Gram-negative and Gram-positive bacteria (Fig. 3a, c). Notably, AMP-15 showed the most potent inhibitory activity, with MIC values of 0.71 µM against E. coli and 1.41 µM against S. aureus.

To further assess the potential of AMP candidates as antibiotic alternatives, we selected the nine top-performing peptides for additional antibacterial and hemolytic assays. Notably, the sequences generated by AMPGen demonstrated strong efficacy (Fig. 3c, Data S4, and Fig. S6). Specifically, these nine most effective AMP candidates, each demonstrating activity against both Gram-negative and Gram-positive bacteria with MIC < 5 μM, were subjected to further antibacterial analysis against additional strains of S. aureus (ATCC 25923) and E. coli (ATCC25922), as well as other pathogens including P. aeruginosa (ATCC 27853) and E. faecalis (ATCC 29212). The results (Fig. 3c and Data S4) revealed that all selected candidates displayed strong inhibitory effects against all four tested bacteria, except for AMP-17 and AMP-20, which exhibited MIC values above 25 μM against P. aeruginosa. Although we initially designed AMP sequences for specific targets (E. coli and S. aureus), AMPGen’s modules effectively captured functional information from the AMP dataset within the model’s latent space, enabling the target-specific design of AMP sequences. The results indicated that AMPGen successfully generated potent AMPs targeted at specific pathogens, while some of the designed peptides also exhibited broad-spectrum antimicrobial properties.

This antibacterial mode of action was further confirmed by propidium iodide staining and microscopy (Fig. 4), which clearly indicated membrane disruption upon AMP treatment. For AMPs to serve as effective antibiotic alternatives, they would ideally exhibit strong antibacterial activity with minimal hemolytic effects, indicating selectivity for targeting bacteria over human cells. Comparative analyses of cytotoxic concentration (CC50), hemolytic concentration (HC50), and minimum inhibitory concentrations (MICs) against E. coli (K88) and S. aureus (ATCC 29213) indicated favorable selectivity profiles between antibacterial activity and hemolytic effects (Figs. 4a, b, S6 and S7).

Fig. 4: Bioactivity characterization and action of functional AMPs on *E. coli.*

Effectively assimilated knowledge of the AMPs

To further characterize the conformations and uniqueness of the validated AMP sequences, we analyzed their predicted structures and conducted sequence similarity searches in relevant databases. Based on the conformational characteristics, AMPs can be broadly categorized into α-helical peptides, β-sheet-containing peptides, structured linear peptides, and other mixed-structure peptides^14,22. In this study, we employed AlphaFold3 to predict the conformations of the 38 experimentally verified AMP candidates designed by AMPGen. Based on the results of AlphaFold, the majority of these sequences were identified as α-helical AMPs, followed by β-sheet-containing AMPs, with some also classified as structured linear AMPs, including AMP11, AMP14, and AMP33 (Fig. 3a). Additionally, we used PepFold4, a tool specifically designed for predicting the conformation of short peptides (typically less than 36 amino acids), for comparison³². PepFold4 and AlphaFold generated largely comparable predictions (Fig. S8). It is important to note that this represents the calculated preferred structure. However, the conformations may undergo changes due to the inherent flexibility of the peptide, such as when the AMP interacts with the biofilm. This diversity in conformational structures among the designed AMPs indicates that AMPGen has effectively assimilated comprehensive evolutionary information from the OpenFold database and functional information from the AMP-MSA conditional dataset.

Furthermore, the AMPs designed by AMPGen have not been previously reported in any existing databases. A comparative analysis of the validated sequences against the non-redundant (nr) protein sequence database —which includes entries from GenPept, SwissProt, PIR, PDF, PDB, and NCBI RefSeq — using BLAST revealed no significantly matching sequences (Data S5). Among the 40 sequences analyzed, 18 showed no hits, while the remaining 22 exhibited a percent identity of 72.45% ± 12.0% and a query cover of 83% ± 20.0% (Table S2 and Data S6). Additionally, the AMP candidates showed evolutionary diversity (Data S7). These findings suggest that AMPGen is capable of successfully designing active AMP sequences that are not currently identifiable through existing data mining approaches.

Although the modules involved in AMPGen function as black box models, the compelling verification results clearly demonstrate its ability to learn within the latent space and uncover hidden patterns and principles in protein evolutionary data.

Discussion

Recent advancements in generative models have enabled the development of protein generation systems capable of autonomously creating proteins from scratch. There have been works based on deep generative models to generate AMPs^{17,21,33,34,35,36,37,38,39,40}. For instance, PepGAN, a GAN-based AMP generation model, produced 6 top-ranked peptides, of which only one exhibited a notably potent effect with an MIC of 3.1 μg/mL²¹. Another study explored the extensive virtual peptide space by enumerating a vast number of sequences composed of 6-9 amino acids. This approach successfully identified several active AMPs, including three hexapeptide AMPs. However, as the length of peptide sequences increases, the data volume and computational demands grow exponentially, posing significant challenges to scalability and feasibility³⁹. In a different study⁴⁰, a variational autoencoder (VAE) model pre-trained on approximately 1.5 million peptide sequences from the UniProt database was employed. This model, when fine-tuned through transfer learning on a smaller dataset of around 5,000 experimentally verified AMPs, enabled the generation of peptide sequences. A CNN/RNN model was subsequently employed to predict the MIC of these candidates, facilitating their ranking as AMP candidates. This approach successfully identified 500 potential AMPs, of which 30 were experimentally confirmed to exhibit antibacterial properties. The above models did not pre-train on the complete protein dataset. It has also been showing that integrating the pretrained protein language models, such as ProtT5 and ESM-2, with diffusion models is an effective strategy to generate peptides^34,41. Besides, some studies have used generative models to design AMP sequences; however, they did not perform functional verification through physical synthesis and experimental evaluation^42,43. Experimental validation is essential for accurately assessing the functional potential of generated sequences. In this study, we synthesized the sequences generated by our model and determined their MIC values following the Clinical and Laboratory Standards Institute (CLSI, M100, 30th ed.). To strengthen the evaluation, we also synthesized representative sequences from previous studies and measured their MIC values under the same conditions for parallel validation and comparison (Tables S3 and S4, Data S8).

AMPGen stands at the forefront of AMP design by generating a wider variety of AMP sequences with strong antibacterial activity (Tables S3 and S4, Data S8, Fig. S9). As variations in target strains and synthesis methods can significantly affect the antimicrobial efficacy of AMPs, direct comparisons across studies are challenging. To address this issue, we selected six of the top-performing sequences from various studies (Data S8), with lengths less than 35 amino acids (where available) and the lowest reported MICs against E. coli (Table S4) These sequences were synthesized and tested for MIC using the same protocol as our AMP. The results demonstrated that AMPGen-generated sequences performed comparably to the top-ranked sequences (Data S4 and Table S4). Then we calculated the pairwise identity of the sequences designed by AMPGen and reported works, which is a common method for quantifying sequence similarity and can help assess how novel or diverse the generated sequences are. Sequence alignment was performed using the BLOSUM62 matrix applied as the scoring scheme. The results indicated that AMPGen generated the most diverse sequences, exhibiting the lowest average pairwise identity (Fig. S9). This suggested that AMPGen was capable of exploring a broader sequence space compared to existing models, which may be attributed to its integration of sequence evolutionary information into the diffusion-based generation process. The resulting low redundancy among AMPGen-generated sequences underscores its strong potential for innovation in AMP design, making it a promising tool for expanding the repertoire of antimicrobial therapeutics. While the incorporation of MSAs introduces some additional computational cost, this overhead remains relatively modest and does not substantially impact the model’s overall efficiency.

Diffusion models are particularly well-suited for generating peptides due to their capacity to produce diverse outputs and their underlying mechanisms, which can mimic the natural processes of protein evolution⁴⁴. In living organisms, gene mutations occur as changes in the DNA sequence, which can alter codons—the triplet sequences of nucleotides in DNA or mRNA that specify particular amino acids. When mutations occur, they can result in different codons being formed during transcription and translation, potentially altering the amino acid sequence of the resulting protein⁴⁵. These changes can impact the protein’s structure and function, leading to various biological consequences. Interestingly, the evolutionary process of proteins in living organisms mirrors the point-by-point addition and reduction of noise within an order-independent diffusion model²⁵. In this analogy, the modulation of noise at each site corresponds to amino acid “mutations” within the latent space, with the generator trained to recognize permissible computational “mutations” within the vast universal protein library. The generator, based on an order-agnostic autoregressive diffusion model, captures the evolutionary information inherent in amino acid sequences, with the goal of generating sequences that are both biologically plausible and evolutionarily sound. During protein evolution, mutations occur randomly and are finally represented as matrixed MSA data for input into the model. As a result, the order-agnostic model is well-suited to learning protein evolution. Considering the conformation heterogeneity of peptides, we implemented the generation module based on protein evolution, which does not rely on PDB structural data.

Generating proteins with specific functions represents one of the most promising yet challenging frontiers in the application of large-scale models. A key challenge in this field is establishing a reliable association between protein sequence and function, particularly given the scarcity of functional protein data. To overcome these challenges, we employ a modular cascade model that enhances the accuracy of peptide generation. Cascade models improve accuracy by incrementally refining decisions or predictions⁴⁶. Initial stages can quickly eliminate obvious cases, allowing subsequent stages to focus on more complex or nuanced data. In AI cascade models, the interdependence of modules can enhance overall model performance. However, errors or inaccuracies in any single module can propagate and magnify throughout the processing chain, ultimately leading to a substantial degradation or even complete system failure—a phenomenon known as cascading failure. In AMPGen, the discriminator plays a crucial role in mitigating cascading failures. This is important because all scorers within the model are trained on antibacterial data from empirical experiments. Given that the antibacterial dataset predominantly consists of AMPs, the scorers exhibit low confidence when evaluating non-AMP sequences. To address this limitation, we introduce a discriminator trained on both AMP and non-AMP datasets, which we specifically curated. We deliberately selected different models for the discriminator and scorer modules to avoid redundancy and allow task specialization. This enhancement mitigates scoring bias, leading to more reliable performance across a broader range of sequences.

We designed an XGBoost-based classifier as the discriminator and an LSTM as the scorer to leverage their complementary strengths. One of the key benefits of XGBoost is its robust performance in handling tabular data and its ability to effectively capture complex, non-linear relationships between features, particularly in scenarios where feature importance can guide decision-making. XGBoost performed well in leveraging features from peptide sequences⁴⁷. On the other hand, LSTM networks are specifically designed to handle sequential data and can capture long-range dependencies between amino acids in a peptide, which is crucial for understanding its function. And LSTMs, through their non-linear nature, are well-suited for regression tasks where the relationship between input features (the peptide sequences were embedded by ESM-2) and the output (MIC values) is non-linear³⁹. Given the complexity of biological data, linear relationships are unlikely to be sufficient, necessitating the use of machine learning methods. Both XGBoost and LSTM were capable of capturing non-linear relationships: XGBoost does so by combining multiple decision trees to model complex decision boundaries, while LSTM achieves this through deep neural network layers in which each neuron applies a non-linear activation function to its input. It is worth noting that the design and structure of AMPGen can also be applied to the development of other short peptide families, such as therapeutic peptides capable of crossing membranes or targeting intracellular sites.

It is not unexpected that designed peptides generally retain broad-spectrum properties because the design strategy focused on selecting sequences with high scores for a specific target, while not excluding potential applicability to other targets. Moreover, AMPs often exhibit broad-spectrum activity due to their conserved mechanisms of bacterial membrane disruption; achieving strict species selectivity remains a challenge. Additionally, the training dataset of the scorer is based on the species level, while in reality, the same AMP has different antibacterial effects on different strains of the same species, which introduces noise to the target-specific scorer. Therefore, developing models for target-specific AMPs remains a challenge and may require the use of more precise negative datasets.

In conclusion, the incorporation of diffusion models and evolutionary information within a cascade architecture provides a promising approach for the design of functional peptides. By leveraging the diversity-generating capabilities of diffusion models and the informative power of evolutionary data, AMPGen addresses key challenges in de novo AMP design. The resulting AMPs exhibit broad-spectrum activity and improved efficacy, offering a robust solution to the growing problem of antimicrobial resistance. This innovative methodology not only enhances the efficiency of AMP generation but also paves the way for the development of functional peptides. At the same time, this study has certain limitations. Although we have generated candidate AMPs, there remains a long path toward clinical application. Further experimental validation at advanced stages will provide stronger driving forces for future model development. Overall, while generative large models hold great promise for short peptide design, challenges such as the limitations of biological data, unknown functional landscapes, and model interpretability still remain. Foundational models that are better aligned with the complexity of protein language are still a vision.

Methods

Dataset preparation

Both AMP and non-AMP datasets were used in the AMP classification model and the MIC regression prediction model, and the AMP dataset is used as a hint when generating AMP (Fig. S10).

The AMP dataset was compiled from six public AMP databases: APD⁴⁸, DADP⁴⁹, DBAASP⁵⁰, DRAMP⁵¹, YADAMP⁵², and dbAMP⁵³. AMP data from these various datasets is merged, deduplicated, and filtered to remove incomplete or meaningless data entries. Finally, our database consisted of 10,249 unique sequences, and 9854 items have antibacterial targets of bacteria.

The negative dataset (non-AMP sequences) was sourced from the UniProt database⁵⁴. All the sequences from the UniProt database with lengths ranging from 5 to 65 amino acids, were filtered to exclude those associated with 17 specific keywords: antimicrobial, antibiotic, antibacterial, antiviral, antifungal, antimalarial, antiparasitic, anti-protist, anticancer, defense, defensin, cathelicidin, histatin, bacteriocin, microbicidal, fungicide and toxin. As with the positive dataset, sequences containing ambiguous amino acids (indicated by U, O, B, Z, J, or X) were excluded. This resulted in a total of 11989 peptide sequences labeled with non-AMP.

De novo AMP generation

Two pre-trained order-agnostic autoregressive diffusion models (OADM) were deployed for de novo AMP sequence generation, one was trained on amino acid sequence data, and the other on evolutionary multiple sequence alignment (MSA) data⁵⁵. In both models, the length of the generated sequences was set to 15–35 amino acids. This is a common length for AMP and a synthetically friendly and suitable for industrial synthesis. Sequences containing ambiguous amino acids (indicated by U, O, B, Z, J, or X) were excluded from the dataset.

The diffusion model used in this study generalizes traditional left-to-right autoregressive models by allowing sequence generation in any arbitrary order, not just a fixed left-to-right progression. This flexibility is particularly advantageous for generating short peptide sequences like AMPs, where the order of amino acids does not necessarily follow a natural or obvious progression.

Mathematically, the model operates by first sampling a random decoding order ${{{\rm{\sigma }}}}$ from all possible orders ${S}_{L}$, where L is the sequence length. The log-likelihood of generating a sequence x is then expressed as an expectation over all possible decoding orders:

$$\log p\;\left(x\right)\approx E{{{\rm{\sigma }}}}\sim U\left({S}_{L}\right)\left[\sum t={1}^{L}\log p\left({x}_{{{{\rm{\sigma }}}}\left(t\right)}|{x}_{{{{\rm{\sigma }}}}\left( < t\right)}\right)\right]$$

Here, ${x}_{{{{\rm{\sigma }}}}\left(t\right)}$ denotes the amino acid at position t in the sequence according to the order ${{{\rm{\sigma }}}}$, and ${{{{\rm{x}}}}}_{{{{\rm{\sigma }}}}\left( < {{{\rm{t}}}}\right)}$ represents all preceding amino acids in this order.

The model is trained by minimizing the loss function derived from this log-likelihood, which involves predicting the probability distribution of each amino acid in the sequence conditioned on its preceding amino acids as determined by the randomly sampled order ${{{\rm{\sigma }}}}$. This training process allows the model to learn from predictions of all masked positions at each timestep, thus generalizing the autoregressive framework to consider all possible decoding orders.

The sequence-based model Evodiff-OA_DM_640M was pre-trained on Uniref50 dataset which contains 42 million protein sequences, provided by the UniProt Reference Clusters databases⁵⁶. The model adopted ByteNet, a CNN architecture, which runs in time that is linear in the length of the sequences and sidestepped the need for excessive memorization⁵⁷. We use the sequence-based model to unconditionally generate peptide sequences of length 15–35 amino acids. Generating 50,000 sequences required approximately 5 days using a single NVIDIA A6000 GPU.

The MSA-based model Evodiff-MSA_OA_DM_MAXSUB was trained on the OpenFold dataset, containing 401,381 MSAs for approximately 132,000 unique Protein Data Bank sequences and approximately 15 million UniClust30 clusters⁵⁸. This model adopted a 100 M parameter MSA Transformer architecture, which processes the 2D MSA input by interleaving row and column attention across the MSA matrix⁵⁹.

Mathematically, the MSA Transformer represents the input MSA as a matrix $x\in {R}^{M\times L}$, where M is the number of sequences and L is the sequence length. The model applies axial attention, alternating between attention over rows and columns of the matrix, thereby reducing the computational complexity to ${{{\mathscr{O}}}}\left(M{L}^{2}\right)$ for row attention and ${{{\mathscr{O}}}}\left(L{M}^{2}\right)$ for column attention. Additionally, tied row attention is employed to share attention maps across sequences, leveraging the shared structure among the aligned sequences, and using square-root normalization to maintain consistent attention weights across the MSA. The model is pre-trained using a masked language modeling (MLM) objective, where the loss function is given by:

$${L}_{{MLM}}\left({x;}{{{\rm{\theta }}}}\right)={\sum}_{\left(m,i\right)\in {\mbox{mask}}}\log p\left({x}_{{mi}}|\widetilde{x};{{{\rm{\theta }}}}\right).$$

Here, $\left(m,i\right)$ refers to the masked positions, and $p\left({x}_{{mi}}|\widetilde{x};{{{\rm{\theta }}}}\right)$ represents the probability of correctly predicting the masked amino acid at position $\left(m,i\right)$, given the masked MSA $\widetilde{x}$.

Using MSA-based model, we generate amino acid sequences in two separate ways, the one is to unconditionally generate peptide sequences of length 15-35 amino acids (MSA-based), and the other is to mask MSAs with known AMP sequences as representative sequences for conditional generation (MSA-conditional). For MSA-conditional generation, we first constructed an AMP MSA dataset as the diffusion model input. MSAs were generated for each AMP sequence in the AMP dataset described above by searching UniClust30⁶⁰ with HHblits²³. Sequences that cannot be clustered to obtain MSAs were removed. The average depth of the MSAs is 34.5. The generated MSA is in A3M format. While in the MSA-based generation, an uniformly sampled inputs were used. In both MSA-conditional (generating 70,000 sequences required approximately 6 days using a single NVIDIA A6000 GPU) and MSA-based (generating 70,000 sequences required approximately 72 days using a single NVIDIA A6000 GPU) generation, the generated sequence length was set to 15-35 amino acids.

XGboost-based AMP classification

Dataset preparation. Sequence in the AMP dataset described above were filtered based on length, retaining those within the range of 5 to 65 amino acids, which resulted in a total of 9964 AMP-labled peptide sequences as the positive dataset.

Prior to training the XGboost machine learning model, the peptide sequences in the dataset were converted into numerical values called features. We adopted the feature set reported in the literature for machine learning prediction of MIC values⁶¹. These representative features for encoding AMP sequences were selected from an initial set of 4481 features⁶¹. To be more specific, the feature selection was performed using a two-step process. Initially, a Random Forest (RF) model with 400 trees was employed to rapidly assess the regressive performance of each of the features examined. The DBAASP⁴⁰ dataset (total 3929 items) was randomly split into training, validation, and test sets, with the test set comprising 10% of the data. The validation set was created by removing 10% of the remaining data. To ensure reliable error estimation, this splitting process was repeated three times, with each experiment run in triplicate, the average pearson correlation coefficient (PCC) scores in the three validation sets were calculated for each feature type to identify the top performers as follows:

$${PCC}={{{\rm{\rho }}}}=\frac{{\sum }_{i=1}^{N}\left({y}_{i}-{{{{\rm{\mu }}}}}_{y}\right)\left(\hat{{y}_{i}}-{{{{\rm{\mu }}}}}_{\hat{y}}\right)}{\sqrt{{\sum }_{i=1}^{N}{\left({y}_{i}-{{{{\rm{\mu }}}}}_{y}\right)}^{2}{\sum }_{i=1}^{N}{\left(\hat{{y}_{i}}-{{{{\rm{\mu }}}}}_{\hat{y}}\right)}^{2}}},$$

where ${y}_{i}$ represents the true target value, $\hat{{y}_{i}}$ the predicted value, ${{{{\rm{\mu }}}}}_{y}$ and ${{{{\rm{\mu }}}}}_{\hat{y}}$ are the means of the true and predicted values respectively, and N is the total number of samples. This step allowed us to rank the feature types based on their predictive power.

Subsequently, the identified top-performing feature types were then evaluated using a Multi-Branch-CNN-Attention model. Throughout these evaluations, A variety of combinations of these features were tested to find the optimal combination. The models were trained using five-fold cross-validation, and the performances were compared. The optimal number of features was ultimately selected based on their average PCC scores, leading to the identification of the best feature combination that yielded the best predictive accuracy for MIC values against E. coli. The best feature combination was validated in machine learning models including TML17 model (which automates the evaluation of 17 different machine learning models), RF model, and the Support Vector Machine (SVM) model.

We applied the above optimal feature combination to the XGBoost binary classification model. The selected features were primarily derived from the PseKRAAC encoding method²⁸, which includes various clustering types encoding parameters, as well as the QSOrder²⁹ encoding method. The final selection comprised 14 categories encompassing 1311 features.

In model training, following the feature engineering process, the data was used to train an XGBoost model⁶². In this training process, AMP sequences were labeled as 1, and non-AMP sequences were labeled as 0. The XGBoost model optimizes a regularized objective function, which is designed to balance the model’s accuracy and complexity, thereby preventing overfitting. The objective function is defined as:

$$L\left(\phi \right)={\sum }_{i=1}^{n}l\left(\hat{{y}_{i}},{y}_{i}\right)+{\sum }_{k=1}^{K}\Omega \left({f}_{k}\right),$$

where $\hat{{y}_{i}}={{{\rm{\phi }}}}\left({x}_{i}\right)={\sum }_{k=1}^{K}{f}_{k}\left({x}_{i}\right)$ is the prediction for instance ${x}_{i}$, and $\Omega \left(f\right)$ is the regularization term:

$$\Omega \left(f\right)={{{\rm{\gamma }}}}T+\frac{1}{2}{{{\rm{\lambda }}}}{\sum }_{j=1}^{T}{w}_{j}^{2},$$

with T representing the number of leaves in the tree, ${w}_{j}$ the weight of the $(j){th}$ leaf, ${{{\rm{\gamma }}}}$ controlling the number of leaves, and${{{\rm{\lambda }}}}$ controlling the ${L}_{2}$ norm of the leaf weights.

During training, the model is built in an additive manner, optimizing the following objective at each iteration t:

$${L}^{\left(t\right)}={\sum }_{i=1}^{n}l\left({y}_{i},\widehat{{y}_{i}^{\left(t-1\right)}}+{f}_{t}\left({x}_{i}\right)\right)+\Omega \left({f}_{t}\right),$$

which is approximated using a second-order Taylor expansion:

$${L}^{\left(t\right)}\approx {\sum }_{i=1}^{n}\left[{g}_{i}{f}_{t}\left({x}_{i}\right)+\frac{1}{2}{h}_{i}{f}_{t}{\left({x}_{i}\right)}^{2}\right]+\Omega \left({f}_{t}\right),$$

where ${g}_{i}$ and ${h}_{i}$ are the first and second-order gradients, respectively.

Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting. The F1 score is a metric that considers both precision and recall to compute a balanced measure of a model’s accuracy, particularly when dealing with imbalanced classes. It is defined as the harmonic mean of precision and recall:

$${{{\rm{F1}}}}\;{Score}=2\times \frac{{{{\rm{Precision}}}}\times {{{\rm{Recall}}}}}{{{{\rm{Precision}}}}+{{{\rm{Recall}}}}^{\prime} }$$

where recall (also known as sensitivity or true positive rate) is the proportion of actual positive cases correctly identified by the model, calculated as:

$${{{\rm{Recall}}}}=\frac{{TP}}{{TP}+{FN}^{\prime} }$$

with TP representing true positives and FN representing false negatives.

The optimal split at each node during training was determined by maximizing the gain:

$${{{\rm{Gain}}}}=\frac{1}{2}\left[\frac{{\left({\sum}_{i\in {I}_{L}}{g}_{i}\right)}^{2}}{{\sum}_{i\in {I}_{L}}{h}_{i}+{{{\rm{\lambda }}}}}+\frac{{\left({\sum}_{i\in {I}_{R}}{g}_{i}\right)}^{2}}{{\sum}_{i\in {I}_{R}}{h}_{i}+{{{\rm{\lambda }}}}}-\frac{{\left({\sum}_{i\in I}{g}_{i}\right)}^{2}}{{\sum}_{i\in I}{h}_{i}+{{{\rm{\lambda }}}}}\right]-{{{\rm{\gamma }}}},$$

where ${I}_{L}$ and ${I}_{R}$ represent the left and right child nodes after the split.

Shrinkage (learning rate ${{{\rm{\eta }}}}$) was applied to each tree’s predictions to scale them:

$$\widehat{{y}_{i}^{\left(t\right)}}=\widehat{{y}_{i}^{\left(t-1\right)}}+{{{\rm{\eta }}}}\;{f}_{t}\left({x}_{i}\right),$$

It helps to prevent overfitting by reducing the impact of each individual tree. The best-performing model, determined through cross-validation, was then utilized for subsequent analyses.

The hyperparameters to be tuned included learning rate (lr), number of estimators (ne), and maximum depth (md). Grid search was used to find the optimal combination of hyperparameters. Specifically, the lr was tested with values of [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7], the ne was varied from 50 to 2000 in increments of 100, and the md was varied from 4 to 15. In each iteration, the model was trained and validated exclusively on the 10-fold cross-validation. The final chosen hyperparameters were those that achieved the best validation performance (lr = 0.1, md = 5, ne = 2000).

LSTM regression-based MIC scoring

Dataset preparation. All entries in the AMP dataset described above with MIC values were included. Sequences containing ambiguous amino acids, shorter than 5 amino acids, or longer than 65 amino acids were removed. The modeling of MIC prediction is mainly aimed at two bacteria, Escherichia coli (Gram-negative representative) and Staphylococcus aureus (Gram-positive representative). The AMP sequences targeting Escherichia coli totaled 7100, while those targeting Staphylococcus aureus totaled 6482. For sequences with multiple MIC values targeting the same target, the arithmetic mean of the MIC values was calculated. And all values were converted to a uniform unit of μM. The MIC values in the unified unit were then log-transformed (log10). To keep the number of positive and negative samples balanced, we randomly selected 60% of the negative dataset in the previous section of XGBoost, resulting in 7193 sequences, all of which were labeled with a log MIC value of 4. A log MIC of 4 is interpreted as indicating that these sequences did not possess antimicrobial activity, as they would require very high concentrations to exhibit any inhibitory effects against bacterial growth. The labeling helps in training the regressor model.

Two methods for feature extraction and embedding of the datasets were compared: using the same feature extraction as XGBoost and utilizing a pre-trained protein language model, ESM2-t36-3B, for embedding⁶. The ESM-2 model, a transformer-based architecture, employs a masked language modeling (MLM) objective, where the model predicts the identity of masked amino acids in a sequence based on their surrounding context. Mathematically, the probability of a contact between two amino acids i and j, denoted as $p\left({c}_{{ij}}\right)$, is calculated using the formula:

$$p\left({c}_{{ij}}\right)={\left(1+\exp \left(-{{{{\rm{\beta }}}}}_{0}-{\sum }_{l=1}^{L}{\sum }_{k=1}^{K}{{{{\rm{\beta }}}}}_{{kl}}{a}_{{ij}}^{{kl}}\right)\right)}^{-1},$$

where ${{{{\rm{\beta }}}}}_{0}$ is a bias term, L is the number of layers, K is the number of attention heads, and ${a}_{{ij}}^{{kl}}$ represents the symmetrized and APC-corrected attention map values for the $k$th attention head in the l-th layer. The model minimizes the perplexity, defined as:

$${{{\rm{Perplexity}}}}(x)=\exp \left(-\frac{1}{L}{\sum }_{i=1}^{L}\log p\left({x}_{i}|{x}_{\ne i}\right)\right)$$

where L is the length of the sequence, and $p\left({x}_{i}|{x}_{\ne i}\right)$ represents the conditional probability of the i-th amino acid given the rest of the sequence. This formulation enables the ESM-2 model to capture intricate structural dependencies within protein sequences, which are then embedded in a high-dimensional space where structurally and functionally similar proteins are closer together. The embedding method using the pre-trained protein language model ESM-2 demonstrated superior performance according to the Mean Squared Error (MSE) evaluations.

In model training, separate regression models were trained on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) neural networks⁶³. In general, the model consists of two LSTM layers (with a hidden size of 128, batch size of 64, and 200 epochs), one dropout layer with a dropout rate of 0.7, and a fully connected layer. Specifically, the datasets were divided into training, validation, and test sets with a split of 72:18:10, respectively. Each model consisted of two LSTM layers, where the LSTM architecture utilized gates to manage information flow through the network. Specifically, at each time step t, the input gate ${i}_{t}$, forget gate ${f}_{t}$, and output gate ${o}_{t}$ were computed as follows:

$${i}_{t}={{{\rm{\sigma }}}}\left({W}_{i}{x}_{t}+{R}_{i}{y}_{t-1}+{p}_{i}\odot {c}_{t-1}+{b}_{i}\right),$$

$${f}_{t}={{{\rm{\sigma }}}}\left({W}_{f}{x}_{t}+{R}_{f}{y}_{t-1}+{p}_{f}\odot {c}_{t-1}+{b}_{f}\right),$$

$${o}_{t}={{{\rm{\sigma }}}}\left({W}_{o}{x}_{t}+{R}_{o}{y}_{t-1}+{p}_{o}\odot {c}_{t}+{b}_{o}\right).$$

The cell state ${c}_{t}$ was updated according to the equation:

$${c}_{t}={z}_{t}\odot {i}_{t}+{c}_{t-1}\odot {f}_{t},$$

where ${z}_{t}=\tanh \left({W}_{z}{x}_{t}+{R}_{z}{y}_{t-1}+{b}_{z}\right)$.

The final output ${y}_{t}$ was derived from:

$${y}_{t}={o}_{t}\odot \tanh \left({c}_{t}\right).$$

During training, gradients were computed using Backpropagation Through Time (BPTT). The gradient for each gate and cell state was calculated to update the network weights. For example, the gradient of the loss with respect to the output gate ${{{{\rm{\delta }}}}}_{{o}_{t}}$ was computed as:

$${{{{\rm{\delta }}}}}_{{o}_{t}}={{{{\rm{\delta }}}}}_{{y}_{t}}\odot \tanh \left({c}_{t}\right)\odot {{{{\rm{\sigma }}}}}^{{\prime} }\left(\bar{{o}_{t}}\right),$$

where ${{{{\rm{\delta }}}}}_{{y}_{t}}$ is the gradient passed from the subsequent layer, and ${{{{\rm{\sigma }}}}}^{{\prime} }\left(\cdot \right)$ represents the derivative of the sigmoid function.

Similarly, the gradients for the cell state ${{{{\rm{\delta }}}}}_{{c}_{t}}$, forget gate ${{{{\rm{\delta }}}}}_{{f}_{t}}$, input gate ${{{{\rm{\delta }}}}}_{{i}_{t}}$, and cell input ${{{{\rm{\delta }}}}}_{{z}_{t}}$ were computed to propagate the error back through the network and update the weights accordingly. To prevent overfitting, a dropout layer with a dropout rate of 0.7 was incorporated after the LSTM layers. The output layer was a linear transformation, and the models were trained using L2 loss, optimizing with the Adam optimizer. This configuration, along with precise gradient calculations, effectively captured temporal dependencies in the sequence data, leading to accurate regression predictions.

Candidate AMP sequence selection for validation

Following the de novo sequence generation using the three approaches (sequence-based, MSA-based, and MSA-conditional), all the resulting peptide sequences were screened using the XGBoost-based discriminator to identify those predicted to be AMPs. The AMP-classified sequences were then evaluated using the scorer specific to E. coli and S. aureus, respectively. From the top 100 candidates for each target, 20 sequences were randomly selected from each. In total, 40 sequences were chosen for wet-lab validation.

Measurement of minimum inhibitory concentration (MIC)

Chemical peptide synthesis. All peptides used in the experiments were purchased from GenScript and synthesized using their PepPower™ platform, which combines liquid-phase and solid-phase peptide synthesis techniques, and each peptide was confirmed to have a purity greater than 80%. (Data S3). High-Performance Liquid Chromatography (HPLC) and Mass Spectrometry (MS) were used to determine the concentration of chemically synthesized AMPs. HPLC analysis was performed using an Inertsil ODS-SP (4.6 × 250 mm) column with a flow rate of 1 mL/min. The mobile phase consisted of (A) 0.065% trifluoroacetic acid (TFA) in 100% water and (B) 0.05% TFA in 100% acetonitrile. A gradient elution program was employed as follows: 0–25 min (5–65% B), 25–27 min (65–95% B), 27–35 min (5% B). The detection wavelength was set at 220 nm. The MS analysis was further performed using an electrospray ionization (ESI) interface on an Agilent 6200 time-of-flight (TOF) LC/MS system equipped with a UV detector. Finally, 38 sequences out of the 40 candidates above were successfully chemically synthesized.

The MIC of the designed AMPs was determined using the broth microdilution method according to established protocols⁶⁴. The bacterial strains used in this study included E. coli K88, S. aureus (ATCC 29213), E. coli (ATCC 25922), E. faecalis (ATCC 29212), S. aureus (ATCC 25923), and P. aeruginosa (ATCC 27853). The synthesized peptides were dissolved in 1% DMSO at a concentration of 4.5 mg/mL for storage (10×). For the MIC assay, the peptide solutions were subjected to two-fold serial dilutions across columns 1-10 of sterile 96-well plates, with columns 11 and 12 serving as controls containing Mueller-Hinton Broth (MHB). Each dilution (50 µL) was pipetted into the wells before the addition of bacterial suspension. The plates were kept covered when not in use to prevent contamination.

Bacterial cultures were grown in MHB with shaking at 37 °C overnight, then diluted to 1.0 × 10⁶ CFU/mL with MHB. A 50 µL aliquot of each bacterial suspension was added to columns 1–11 of the 96-well plates, which contained the peptide solutions, while column 12 contained only MHB. The plates were incubated at 37 °C for 16–20 h. The MIC assays define the AMP concentration that precludes growth after incubation. All assays were performed in triplicate to ensure statistical reliability.

Assessment of cytotoxicity

A cytotoxicity assay was conducted on IEC-6 intestinal epithelial cells (ATCC, #CRL-1592™) using the CellTiter 96® AQueous One Solution Cell Proliferation Assay (Promega, #G3580). Cells were cultured in media containing DMEM high glucose (BBI, #E600003-0500), 10% fetal bovine serum (FBS, Pricella, #164210-50), and 10% antibiotic mix for cell culture (Solarbio, #P1400). Once the cells reached a surface coverage density of 50-80%, they were harvested, diluted to a density of 5×10³ cells per 100 µL, and seeded into 96-well plates for 24 h.

Following this incubation, the cells were exposed to a series of peptide concentrations, prepared by two-fold serial dilutions starting from 1125 μg/mL. A 4 µL aliquot of each dilution was added to the wells, and the plates were incubated at 37 °C with 5% CO₂ for another 24 h. Subsequently, 20 µL of CellTiter 96® AQueous One Solution was added to each well, and after a 90-min incubation at 37 °C, absorbance was measured at 490 nm, with corrections made using wells containing only medium. CC50 values, representing the concentration of each AMP that kills 50% of the cells, were determined by fitting a 4-parameter logistic model (4PL) using Python’s scipy library.

Assessment of hemolytic activity

Rat erythrocytes (Sbjbio, #SBJ-RBC-RAT004) were washed with Phosphate Buffer(PBS, Pricella, #PB180327) and resuspended to approximately 2% in PBS. Antimicrobial peptides (AMPs) were initially prepared at 1125 μg/mL and then serially diluted in 1.5 mL Eppendorf tubes. Each 20 µL AMP dilution was combined with 180 µL of the 2% erythrocyte suspension. The mixtures were incubated at 37 °C for 1 h after sealing. Following centrifugation at 1000×g for 5 min at room temperature, 50 µL of the supernatant was transferred to a 96-well plate, where absorbance was measured at 540 nm. Triton X-100-treated erythrocytes served as a positive control, while PBS-treated erythrocytes acted as a negative control. The HC50 values, representing the concentration of AMP required to lyse 50% of the RBCs, were determined by fitting a 4-parameter logistic model (4PL) using Python’s scipy library.

Statistics and reproducibility

The MIC determination, cytotoxicity assays, and hemolytic activity tests were performed in three independent biological replicates. The source data were provided in the Supplementary Information or Supplementary Data. Statistical analyses were conducted using GraphPad Prism 9, and significance was determined using one-way ANOVA or other methods as indicated. Reproducibility was confirmed across independent experiments, and representative results are shown.

Mechanism of action analysis using propidium iodide

E.coli K88 in the exponential phase was harvested, centrifuged, resuspended in PBS, and adjusted to an OD of 1. A 10 µL aliquot of the bacterial suspension was mixed with 10 µL of AMPs at a final concentration of 4 × MIC and incubated at 37 °C for 1 h. To both AMP-treated and untreated samples, 20 μM of propidium iodide (PI, Solarbio, #C0080) was added and incubated in the dark at 37 °C for 30 min. Fluorescence was recorded with an excitation at 535 nm and emission at 615 nm using a Tecan Infinite® Eplex plate reader. Additionally, 1 µL of the reaction mixture was applied to a slide and imaged using a Leica DM4B upright microscope equipped with a 100x semi-apochromatic objective.

Analysis of peptides sequence property

The online BLASTp tool was used to blast AMP sequences against the non-redundant protein sequence (nr) database (https://blast.ncbi.nlm.nih.gov). The parameters, including word size of 6, expect threshold of 10, PAM30 matrix, gap initiation penalty of 9 and gap extension penalty of 1, conditional compositional score matrix adjustment, and low complexity regions filter, were used. The hit threshold is E value with an upper limit of 10. Physiochemical properties of peptide sequences were inferred using the Peptides package in R (https://CRAN.R-project.org/package=Peptides, v2.4.6). The annotation of peptide sequences was performed using the Entrez module of the Biopython package (http://biopython.org). To assess the sequence similarity and diversity of AMPs designed, we performed pairwise sequence alignment between the generated sequences and those reported in the literature. Sequence alignment was carried out using the pairwise2 module from the Python Biopackage. The alignment scoring was based on the BLOSUM62 substitution matrix, which implements global and local pairwise alignment algorithms.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The datasets (AMP dataset, nonAMP dataset, and AMP MIC dataset) generated and/or analyzed during the current study are available in the Zenodo repository⁶⁵. Source data behind the graphs in the paper can be found in Supplementary Data 9–12. The supplementary data is provided in the form of tables to facilitate detailed analysis.

Code availability

The code used in this study is openly available in the zenodo repository⁶⁵. This repository includes all scripts and software necessary to replicate the experiments, data analysis, and trained models used in this work. Detailed instructions for installation, usage, and reproduction of the results are also provided in the repository’s README file.

References

Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
Article CAS PubMed PubMed Central Google Scholar
Alley, E. C. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 315–1322 (2019).
Article Google Scholar
Brandes, N. et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shimizu, K. et al. De novo design of a nanopore for single-molecule detection that incorporates a β-hairpin peptide. Nat. Nanotechnol. 17, 67–75 (2022).
Article CAS PubMed Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article CAS PubMed Google Scholar
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ingraham, J. B. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).
Article CAS PubMed PubMed Central Google Scholar
Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).
Article CAS PubMed PubMed Central Google Scholar
Notin, P. et al. Machine learning for functional protein design. Nat. Biotechnol. 42, 216–228 (2024).
Article CAS PubMed Google Scholar
Holehouse, A. S. & Kragelund, B. B. The molecular basis for cellular function of intrinsically disordered protein regions. Nat. Rev. Mol. Cell. Biol. 25, 187–211 (2024).
Article CAS PubMed Google Scholar
Cornishbowden, A. Nomenclature and symbolism for amino acids and peptides. Eur. J. Biochem. 138, 9–37 (1984).
Article Google Scholar
Apostolopoulos, V. et al. A global review on short peptides: frontiers and perspectives. Molecules 26, 430 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mookherjee, N., Anderson, M. A., Haagsman, H. P. & Davidson, D. J. Antimicrobial host defence peptides: functions and clinical potential. Nat. Rev. Drug Discov. 19, 311–332 (2020).
Article CAS PubMed Google Scholar
Lazzaro, B. P., Zasloff, M. & Rolff, J. Antimicrobial peptides: Application informed by evolution. Science 368, eaau5480 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mishra, B., Reiling, S., Zarena, D. & Wang, G. Host defense antimicrobial peptides as antibiotics: design and application strategies. Curr. Opin. Chem. Biol. 38, 87–96 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ma, Y. et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 40, 921–931 (2022).
Article CAS PubMed Google Scholar
Jan, A. et al. Target-AMP: Computational prediction of antimicrobial peptides by coupling sequential information with evolutionary profile. Comput. Biol. Med. 151, 106311 (2022).
Article CAS PubMed Google Scholar
Veltri, D., Kamath, U. & Shehu, A. Deep learning improves antimicrobial peptide recognition. Bioinformatics 34, 2740–2747 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, C. et al. AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genom. 23, 7 (2022).
Google Scholar
Tucs, A. et al. Generating ampicillin-level antimicrobial peptides with activity-aware generative adversarial networks. ACS Omega 5, 22847–22851 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, L. J. & Gallo, R. L. Antimicrobial peptides. Curr. Biol. 26, R14–R19 (2016).
Article CAS PubMed Google Scholar
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinforma. 20, 473 (2019).
Article Google Scholar
Tan, P., Fu, H. & Ma, X. Design, optimization, and nanotechnology of antimicrobial peptides: from exploration to applications. Nano. Today 39, 101229 (2021).
Article CAS Google Scholar
Hoogeboom, E. et al. Autoregressive diffusion models. Preprint at https://arxiv.org/abs/2110.02037 (2022).
Yang, K. K., Fusi, N. & Lu, A. X. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294.e2 (2024).
Article CAS PubMed Google Scholar
Wu, K. E. et al. Protein structure generation via folding diffusion. Nat. Commun. 15, 1059 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zuo, Y. et al. PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics 33, 122–124 (2017).
Article CAS PubMed Google Scholar
Chou, K. C. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun. 278, 477–483 (2000).
Article CAS PubMed Google Scholar
Chen, Z. et al. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PLoS One 6, e22930 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hancock, R. E. & Diamond, G. The role of cationic antimicrobial peptides in innate host defences. Trends Microbiol. 8, 402–410 (2000).
Article CAS PubMed Google Scholar
Rey, J., Murail, S., de Vries, S., Derreumaux, P. & Tuffery, P. PEP-FOLD4: a pH-dependent force field for peptide structure prediction in aqueous solution. Nucleic Acids Res. 51, W432–W437 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, T. et al. A foundation model identifies broad-spectrum antimicrobial peptides against drug-resistant bacterial infection. Nat. Commun. 15, 7538 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wang, X. et al. ProT-Diff: A modularized and efficient strategy for de novo generation of antimicrobial peptide sequences by integrating protein language and diffusion models. Adv. Sci. 11, 2406305 (2024).
Szymczak, P. et al. Discovering highly potent antimicrobial peptides with deep generative model HydrAMP. Nat. Commun. 14, 1453 (2023).
Article CAS PubMed PubMed Central Google Scholar
Van Oort, C. M., Ferrell, J. B., Remington, J. M., Wshah, S. & Li, J. AMPGAN v2: Machine learning-guided design of antimicrobial peptides. J. Chem. Inf. Model. 61, 2198–2207 (2021).
Article PubMed PubMed Central Google Scholar
Dean, S. N., Alvarez, J. A. E., Zabetakis, D., Walper, S. A. & Malanoski, A. P. PepVAE: Variational autoencoder framework for antimicrobial peptide generation and activity prediction. Front. Microbiol. 12, 725727 (2021).
Article PubMed PubMed Central Google Scholar
Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613–623 (2021).
Article CAS PubMed Google Scholar
Huang, J. et al. Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences. Nat. Biomed. Eng. 7, 797–810 (2023).
Article CAS PubMed Google Scholar
Pandi, A. et al. Cell-free biosynthesis combined with deep learning accelerates de novo-development of antimicrobial peptides. Nat. Commun. 14, 7197 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhao, L. et al. Protein A-like peptide design based on diffusion and ESM2 Models. Molecules 29, 4965 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wang, R. et al. Diff-AMP: tailored designed antimicrobial peptide framework with all-in-one generation, identification, prediction and optimization. Brief. Bioinform. 25, bbae078 (2024).
Article CAS PubMed PubMed Central Google Scholar
Mao, J. et al. Application of a deep generative model produces novel and diverse functional peptides against microbial resistance. Comput. Struct. Biotechnol. J. 21, 463–471 (2022).
Article PubMed PubMed Central Google Scholar
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Preprint at https://arxiv.org/abs/2006.11239 (2020).
Crick, F. Central dogma of molecular biology. Nature 227, 561–563 (1970).
Article CAS PubMed Google Scholar
Viola, P. & Jones, M. Rapid object detection using a boosted cascade of simple features. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2001).
Sinitcyn, P. et al. MaxDIA enables library-based and library-free data-independent acquisition proteomics. Nat. Biotechnol. 39, 1563–1573 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, G., Li, X. & Wang, Z. APD2: the updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res. 37, 933–937 (2009).
Article Google Scholar
Novković, M. et al. DADP: the database of anuran defense peptides. Bioinformatics 28, 1406–1407 (2012).
Article PubMed Google Scholar
Pirtskhalava, M. et al. DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 49, 288–297 (2021).
Article Google Scholar
Shi, G. et al. DRAMP 3.0: an enhanced comprehensive data repository of antimicrobial peptides. Nucleic Acids Res. 50, 488–496 (2022).
Article Google Scholar
Piotto, S. P., Sessa, L., Concilio, S. & Iannelli, P. YADAMP: yet another database of antimicrobial peptides. Int. J. Antimicrob. Agents 39, 346–351 (2012).
Article CAS PubMed Google Scholar
Jhong, J. H. et al. dbAMP 2.0: updated resource for antimicrobial peptides with an enhanced scanning method for genomic and proteomic data. Nucleic Acids Res. 50, 460–470 (2022).
Article Google Scholar
UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, 480-489 (2021).
Alamdari, S. et al. Protein generation with evolutionary diffusion: sequence is all you need. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1 (2023).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed Google Scholar
Kalchbrenner, N. et al. Neural machine translation in linear time. Preprint at https://arxiv.org/abs/1610.10099 (2017).
Andritz G. et al. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 21, 1514–1524 (2024).
Rao R. et al. MSA transformer. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.02.12.430858v3 (2021).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, 170–176 (2017).
Article Google Scholar
Yan, J. et al. A deep learning method for predicting the minimum inhibitory concentration of antimicrobial peptides against Escherichia coli using Multi-Branch-CNN and Attention. mSystems 8, e0034523 (2023).
Article PubMed Google Scholar
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).
Greff, K. et al. LSTM: A search space odyssey. In IEEE Trans. Neural Netw. Learn. Syst. 28, 2222–2232 (2017).
Wiegand, I., Hilpert, K. & Hancock, R. E. Agar and broth dilution methods to determine the minimal inhibitory concentration (MIC) of antimicrobial substances. Nat. Protoc. 3, 163–175 (2008).
Article CAS PubMed Google Scholar
Zeng, Z. & Xiong, X. AMPGen v1.0: First Stable Release of the AMP Generation Pipeline (v1.0.0) https://doi.org/10.5281/zenodo.15454482.7433980 (2025).

Download references

Acknowledgements

This work is supported by the National Science and Technology Resources Service Platform project (PT-2024-01), the National Key R&D Program of China (2021YFA0804702), and Zhejiang Lab & Shanghai Artificial Intelligence Laboratory (K2023KA1BB01, K2022KA1BB01 & 2022KA0PI01).

Author information

These authors contributed equally: Shuwen Jin, Zihan Zeng, Xiyan Xiong.

Authors and Affiliations

Zhejiang Lab, Hangzhou, 311121, China
Shuwen Jin, Zihan Zeng, Xiyan Xiong, Baicheng Huang, Hongsheng Wang & Xiao Ma
Polytechnic Institute, Zhejiang University, Hangzhou, 310015, China
Zihan Zeng
Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China
Li Tang
Xianghu Laboratory, Zhejiang Academy of Agricultural Sciences, Hangzhou, 310021, China
Xiaochun Tang, Guoqing Shao & Feng Lin
Laboratory of Pancreatic Disease, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310058, China
Xingxu Huang
Fuyao University of Science and Technology, Fuzhou, 350100, China
Feng Lin

Authors

Shuwen Jin
View author publications
Search author on:PubMed Google Scholar
Zihan Zeng
View author publications
Search author on:PubMed Google Scholar
Xiyan Xiong
View author publications
Search author on:PubMed Google Scholar
Baicheng Huang
View author publications
Search author on:PubMed Google Scholar
Li Tang
View author publications
Search author on:PubMed Google Scholar
Hongsheng Wang
View author publications
Search author on:PubMed Google Scholar
Xiao Ma
View author publications
Search author on:PubMed Google Scholar
Xiaochun Tang
View author publications
Search author on:PubMed Google Scholar
Guoqing Shao
View author publications
Search author on:PubMed Google Scholar
Xingxu Huang
View author publications
Search author on:PubMed Google Scholar
Feng Lin
View author publications
Search author on:PubMed Google Scholar

Contributions

S.W. J., Z.H. Z. and X.Y. X. drafted the manuscript, conducted data analysis, and developed the model pipeline and screening. B.C.H. assisted with writing and data analysis. Z.H. Z. led wetlab experiments and L. T. assisted with the experiments. H.S.W. and X. M. provided technical support. G.Q.S. and X.C.T. assisted with the literature review and provided feedback on the manuscript. F. L. and X.X. H. conceived of the study and supervised its performance of the study. All authors reviewed and revised the manuscript.

Corresponding authors

Correspondence to Xingxu Huang or Feng Lin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Tobias Goris.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (Fig. S1-10 & Table S1-4)

Supplementary Data 1-12

Description of Additional Supplementary Files

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jin, S., Zeng, Z., Xiong, X. et al. AMPGen: an evolutionary information-reserved and diffusion-driven generative model for de novo design of antimicrobial peptides. Commun Biol 8, 839 (2025). https://doi.org/10.1038/s42003-025-08282-7

Download citation

Received: 04 November 2024
Accepted: 23 May 2025
Published: 30 May 2025
DOI: https://doi.org/10.1038/s42003-025-08282-7