Introduction

AI-driven protein design enables the generation of protein sequences that are not found in nature by leveraging deep learning, generative models, and evolutionary principles. This progress in protein design models is poised to revolutionize biomedical technologies, significantly accelerating advancements in the medical healthcare system. Currently, protein generation is predominantly achieved through two types of large protein models, which are trained on an entire protein library. The first type is sequence-based models, which aim to capture the biochemical constraints that characterize the proteins within the training set derived from extensive sequence data, with ProGen being a notable example1,2,3,4,5. The second is structure-based models, which aim to align the sequence-structure-function relationship of proteins, as represented by models like RoseTTAFold6,7,8,9. These protein design models have already demonstrated remarkable success in various applications, including enzyme design, optimization, and antibody engineering10.

Despite the successes in those proteins with stable secondary and tertiary structure, few were in a significant subset of proteins that lack such stable structure, especially peptides and intrinsically disordered proteins. Indeed, modeling short and flexible peptides challenges the conventional sequence-structure-function paradigm11, though the distinction between peptides and proteins is not strict and universally accepted. According to the International Union of Pure and Applied Chemistry (IUPAC), oligopeptides typically contain fewer than 10–20 amino acids, while polypeptides consist of more than 20 residues. Proteins, on the other hand, are generally defined as polypeptides with more than approximately 50 amino acids12. Some researchers, however, define short peptides as those containing no more than 45 amino acids13.

Peptides, including signaling peptides, peptide hormones, neuropeptides, therapeutic peptides, and antimicrobial peptides (AMPs), are prevalent across all life forms and perform critical biological roles despite their lack of stable structure. However, the current generation of protein design models, which often rely on structural information or focus on generating protein backbones, are limited in their ability to effectively address the unique characteristics of structurally unstable peptides. To address this challenge, we developed a model specifically for generating functional protein sequences without defined tertiary structures, using AMPs as a case study.

AMPs are a class of peptides, typically composed of 12–50 amino acids, that can kill bacteria, viruses, and fungi by disrupting biofilms or forming transmembrane channels. Unlike structured proteins, AMPs are inherently disordered and lack a defined tertiary structure, exhibiting a high degree of plasticity14. AMPs have demonstrated high biocompatibility, a broad antimicrobial spectrum, and do not induce drug resistance, making them promising candidates for clinical translation as new therapeutic agents15. As a result, they are emerging as potent candidates for innovative therapeutic interventions. While traditional bioinformatics approaches have shown promise in identifying AMPs from genomic database16,17,18,19,20, these methods are constrained by the limitations of existing databases. Although several studies on protein generation models have shown potential in creating uncharacterized functional AMPs, these models often fall short in effectiveness due to their lack of adaptation to the structurally flexible nature of peptides21.

In this work, we propose a generative model, AMPGen, for the de novo design of target-specific AMP sequences. It comprises a generator, a discriminator, and a scorer, augmented by necessary biochemical knowledge-based screening. The generator leverages an order-agnostic autoregressive diffusion model that is pre-trained on the OpenFold database (https://registry.opendata.aws/openfold/), incorporating an axial attention mechanism to capture protein evolutionary information in multiple sequence alignment (MSA) format within the latent space. An AMP-MSA dataset is constructed and used as input to enhance the model’s success rate. Compared to baseline models, incorporating evolutionary information enhances the model’s learning capability. Considering both the synthesis cost and potential applications, we define the length of the generated sequences to range from 15 to 35 amino acids. The generated sequences are then sequentially filtered based on their physicochemical properties and evaluated with an XGBoost-based discriminator. Finally, target-specific scoring is conducted using an LSTM-based scorer, ultimately yielding the final AMP candidates. Experimental validation demonstrates that AMPGen possesses a distinctive and highly efficient capability for AMP generation, achieving an 81.58% positive rate, and producing AMPs that, to the best of our knowledge, have not been previously reported in existing protein databases.

Results

The architecture for de novo AMP design

To address the unique challenges posed by the short length, high diversity and inherent flexibility of short peptide sequences22, the AMPGen architecture possesses several key innovations (Fig. 1). Central to AMPGen is a cascade model consisting of a generator, a discriminator and a scorer, each contributing distinct dimensions of AMP-specific information to enhance the learning process and provide a comprehensive understanding of AMP characteristics. The generator is initially trained on a large, universal protein database to learn the fundamental patterns of protein sequences. To refine the model’s focus on AMP, we employed a dataset enriched with AMP evolutionary information as input and incorporated an axial attention mechanism to enhance the model’s learning capabilities. The generated sequences are subsequently filtered using a discriminator based on a binary XGBoost classifier, followed by target-specific scoring via an LSTM regression model. The discriminator and scorer employ masking techniques for feature extraction and a language model embedding approach, respectively. Their combined application enhances the performance of AMPGen. Furthermore, to accommodate the structurally flexible nature of AMPs, the generator is specifically designed to rely solely on one-dimensional sequence data without incorporating structural data. The order-independent diffusion model is employed as the generator due to its ability to produce a wider diversity of results.

Fig. 1: Overview of AMPGen for de novo AMP sequence design.
figure 1

a Candidate peptide sequences are initially generated using a diffusion model pre-trained on the OpenFold database, with the AMP-multiple sequence alignment (MSA) dataset as a condition, referred to as conditional generation (MSA-conditional, indicated by red arrows). The model adopted a 100 M parameter MSA Transformer architecture. Baseline comparisons were made with sequences generated without the condition, referred to as unconditional generation (MSA-based, indicated by green arrows), using the same architecture, as well as with a separate model trained on Uniref50 adopting a ByteNet-style CNN architecture (Seq-based, indicated by blue arrows). The generated sequences are constrained to 15–35 amino acids in length to ensure appropriate AMP size and to manage synthesis costs. b After cleaning and filtering the initial generated sequences, a binary XGBoost-based discriminator is developed to determine whether they qualify as AMPs. This discriminator is trained on an AMP dataset and a negative dataset of non-AMP sequences, using a combination of feature extraction methods (PseKRAAC and QSOrder) as the embedding. c Sequences identified as AMPs are then subjected to target-specific scoring using a long short-term memory (LSTM) network. Deploying an ESM-2 embedding strategy, this scorer is trained on an AMP dataset with minimum inhibitory concentration (MIC) values.

Firstly, an AMP-MSA dataset containing evolutionary information is constructed as the input of the model (MSA-conditional generation). For each sequence in the AMP dataset, we generated an MSA by searching the UniClust30 database with HHblits23. Aside from this, the only required input for sequence generation by the model is the specified length range of the desired sequence. Considering the known properties of AMPs and the cost of synthesis, we define the length of the generated sequences to be between 15 to 35 aa24. To evaluate the effectiveness of our approach, we employed two other generation methods as baselines for comparisons: a generation method based solely on protein sequences (seq-based generation), and a method based on an evolutionary-scale dataset of MSAs without model input (MSA-based generation). AMPGen employs an order-agnostic autoregressive diffusion model, pre-trained on the entire protein sequence database, for the generation of short peptide sequences25, while the seq-based generation model adopts a ByteNet-style CNN architecture26. The diffusion model is a generative simulation technique that has demonstrated success in both image and text generation. These models are capable of producing highly diverse outputs and can be conditioned on input data, making them well-suited for the generative modeling of peptides27. By comparing our model with the baselines, we can directly assess the impact of incorporating sequence evolutionary information and conditional datasets on enhancing the model’s ability to design sequences.

After generating millions of initial candidate sequences, we filtered out those containing ambiguous amino acids (indicated by U, O, B, Z, J, or X), resulting in a set of clean sequences. To further refine the dataset, we retained only sequences with a net positive charge (net charge >0 at pH 7) and a hydrophobic amino acid proportion between 40% and 70%24. These physicochemical criteria are characteristic of AMPs and are crucial for their activity. An XGBoost-based discriminator is then built to determine whether each candidate sequence is an AMP. The discriminator employs an embedding approach utilizing various feature extraction methods28,29,30. For sequences classified as AMPs, we trained an LSTM regression model to predict their minimal inhibit concentration (MIC) values against target species. Specifically, Gram-negative Escherichia coli and Gram-positive Staphylococcus aureus were selected as target species for subsequent wetlab validation. The LSTM regression model utilizes a protein language model-based embedding technique, specifically ESM2-t36-3B6. ESM-2 is a transformer-based model designed to capture complex relationships across protein sequences. The embeddings generated by ESM-2 represent features for each residue, enabling the LSTM to learn dependencies and patterns over the sequence. Regarding model performance, the XGBoost discriminator achieved an F1 score of 0.96, an accuracy of 0.96, and a recall of 0.95. Model performance was assessed using ten-fold cross-validation and ROC analysis, yielding an average area under the curve (AUC) of 0.99 (Fig. S1). The LSTM model for predicting MIC values achieved an R-squared value of 0.89 on the validation set for E. coli and 0.86 for S. aureus (Fig. S2).

High-throughput generation of candidate AMPs

In the de novo design of AMPs, we initially generated a total of 70,000 raw sequences using the MSA-conditional generation method. For baseline comparisons, we also generated 70,000 sequences using the MSA-based generation and 50,000 sequences using the seq-based generation. After filtering out sequences containing ambiguous amino acids, we obtained 59,944 clean sequences from the MSA-conditional generation method, 47,511 from the MSA-based generation and 49,999 from the seq-based generation method (Fig. 2a–c, Table S1). It is important to note that the number of generated sequences was determined by the experimental setup. All subsequent comparisons between the generation strategies and baseline models are therefore based on relative ratios of sequences.

Fig. 2: General features of generated AMP candidate sequences.
figure 2

The inverted pyramid diagram illustrates the process of refining the initial candidate peptide sequences into AMP candidates for the a seq-based, b MSA-based, and c MSA-conditional groups. The ‘Clean sequence’ denotes the initial pool of sequences after ambiguous amino acids have been removed, with subsequent steps showing the percentage of sequences remaining after each conditional screening. The ‘Physical properties’ refer to sequences filtered with physical properties (with a net charge greater than 0 at pH 7 and a hydrophilic amino acid ratio between 40% and 70%). The ‘XGBoost’ refers to sequences identified as AMPs by the XGBoost-based discriminator. The accompanying Venn diagram shows AMP candidates with predicted activity against Escherichia coli and Staphylococcus aureus, defined by a predicted MIC value of less than 5 μM using the LSTM-based scorer. d The proportion of each amino acid in the AMP candidates that passed the discriminator. The distribution of physicochemical properties for candidate AMP sequences includes e sequences length, f net charge, g isoelectric point, h hydrophobic moment, i Boman index, and j instability index for AMP sequences classified as AMPs by the discriminator. The MSA-conditional model consistently generates a higher number of sequences with diverse physicochemical properties compared to the Seq-based and MSA-based models, indicating enhanced variability and potential functional diversity in the MSA-conditional generated peptides. Panels show violin plots depicting the distribution of predicted minimum inhibitory concentration (MIC) values (μM, log10-transformed) against E. coli (k) and S. aureus (l). The number of sequences evaluated for each model is indicated at the bottom of each violin plot. The MSA-conditional model generated sequences with significantly lower MIC values, indicating higher antibacterial activity against both E. coli and S. aureus compared to the Seq-based and MSA-based models. Statistical significance is indicated by asterisks (***), representing P < 0.001 (Kruskal-Wallis multiple comparison, with P-values adjusted using the Benjamini-Hochberg method).

To assess whether the generator produced short peptide sequences with AMP-like characteristics, we analyzed the physical-chemical properties and amino acid composition of the clean sequences. All three groups of generated sequences exhibited positive charges and high isoelectric points, similar to validated AMPs in public databases. However, these properties did not significantly distinguish them from non-AMP sequences (Fig. 2e–j, Data S1). Regarding amino acid composition, AMPs are reported to contain higher contents of positively charged amino acids such as lysine, arginine, and histidine, as well as hydrophobic amino acids, which form the structural basis for their biological activities and antibacterial effects31. Our result indicated that the AMP dataset had higher proportions of lysine (16.25% ± 13.4% in AMPs vs. 8.41% ± 6.9% in nonAMPs) and leucine (12.54% ± 12.4% in AMPs and 8.73% ± 6.2% in nonAMPs), with fold changes of 1.93 and 1.44, respectively (Fig. 2d, Data S1). The model successfully learned these characteristics from AMP sequences and their evolutionary information, resulting in generated sequences with a lysine proportion of 12.07% ± 8% and a leucine proportion of 9.65% ± 6.9% (Data S1).

Evolutionary information and conditional input reserved AMPs

To assess whether incorporating evolutionary information and conditional input in the generator enhanced its ability to generate functional peptides, we compared the MSA-conditional generation with two baselines. Sequences that passed the discriminator were considered AMP candidates. From the MSA-conditional generation, we obtained a total of 28,439 AMP candidates, while the MSA-based generation yielded 7,608 AMP candidates, and the seq-based generation produced 3,396 AMP candidates. These figures correspond to 47.44%, 16.01%, and 6.79% of the clean sequences in each group, respectively (Fig. 2a). The MSA-conditional generation approach demonstrated a higher success rate based on model predictions compared to the baselines that rely solely on sequence databases or lack an AMP-MSA database as a condition. This result indicates that the generative model has effectively learned and incorporated the evolutionary information encoded within the AMP-MSA dataset, improving its capability to design functional AMPs. This conclusion is derived from calculated physical property screening and predictions generated by the XGBoost classifier, rather than direct experimental validation, which still provided a certain degree of explanatory power.

AMPGen is designed to generate AMP candidates against specific antimicrobial targets. The commonly used Gram-negative target Escherichia coli and Gram-positive target Staphylococcus aureus were selected as representative species. The LSTM-based scorer was employed to rank all sequences previously identified as AMPs based on their MIC values. Using a threshold MIC value of less than 5 μM, the scorer determined that 3.88% of the sequences generated by the MSA-conditional method were potent against E. coli and 2.15% were potent against S. aureus (Fig. 2a). This generation success rate was higher than the MSA-based generation baseline, which yielded 0.32% anti-E. coli AMPs and 0.2% anti-S. aureus AMPs. The seq-based generation method produced even lower pass rates, with only 0.04% anti-E. coli AMPs and 0.01% anti-S. aureus AMPs. An analysis of the physical-chemical properties of the AMP candidates (Fig. S3S5) revealed that the hydrophobicity of sequences generated by the MSA-conditional method was generally higher (mostly distributed at values greater than 0) compared to those generated by the seq-based and MSA-based methods (mostly distributed at values less than 0). Interestingly, although the raw sequences across all generation methods were uniformly distributed in length between 15 and 35 amino acids, the AMP candidates generated by the seq-based and MSA-based methods predominantly clustered below 20 amino acids. In contrast, the AMP candidates generated by the MSA-conditional method tended to be longer, with the majority exceeding 20 amino acids (Fig. 3b, Data S1).

Fig. 3: Validation of AMP candidates.
figure 3

a The AlphaFold 3 predicted structure and net charge (at pH 7) of the 38 validated AMPs. AMP-1 to AMP-20 were randomly selected from the top 100 candidates targeting S. aureus, and AMP-21 to AMP-40 were randomly selected from the top 100 candidates targeting E. coli. Among them, the chemical synthesis of AMP-5 and 18 failed. The color of the ellipse in the lower right corner of each 3D structure diagram represents the minimum inhibitory concentration (MIC) value against E. coli K88 (left) and S. aureus ATCC 29213 (right). b Comparison of antimicrobial activity of AMP candidates against Gram-positive and Gram-negative bacteria. b displays the log10-transformed MIC values (μM) of AMP candidates against Gram-positive S. aureus ATCC 29213 (y-axis) and Gram-negative E. coli K88 (x-axis). Each dot represents a different AMP, color-coded by its designed target: yellow-green for AMPs with a top 100 score against Gram-positive bacteria (G+) and green for those with a top 100 score against Gram-negative bacteria (G-). The control of antibiotics Polymyxin B and Ampicillin is highlighted in red. AMPs clustered near the origin show broad-spectrum efficacy, while those positioned at the extremes indicate selective efficacy against either Gram-positive or Gram-negative bacteria. c Nine AMPs with MIC values less than 5 μM for both targets in a were selected for further determination of MIC against E. coli (ATCC 25922), P. aeruginosa (ATCC 27853), S. aureus (ATCC25923), and E. faecalis (ATCC 29212). The figure shows the MIC values of the nine AMPs against each target. Ampicillin and Polymyxin B were used as positive controls. The MIC was determined by averaging the results from triplicate assays across three independent experiments (n = 3).

To validate the antibacterial activity of the AMPs designed by AMPGen, 20 sequences were randomly selected from the top 100 candidates targeting E. coli and S. aureus, respectively (Data S2). These sequences were chemically synthesized and subsequently subjected to antibacterial performance assays.

A Wet-lab validation protocol

To confirm the antibacterial activity of the AMP sequences generated by AMPGen, wet-lab antibacterial assays are conducted for experimental validation. Out of the 40 selected AMP candidates, 38 were successfully chemically synthesized —18 targeting S. aureus and 20 targeting E. coli —resulting in a 95% synthesis success rate (Data S3 and Table S2). We determined the MIC values of the synthetic AMPs against the common pathogens E. coli (K88) and S. aureus (ATCC 29213), using Ampicillin and Polymyxin B as positive controls. Of the 38 synthesized candidates, 31 exhibited antibacterial effects (MIC ≤ 75 µM against S. aureus or E. coli), achieving an 81.58% positive design rate. Specifically, 23 candidates showed MICs of ≤25 µM against E. coli, and 11 showed MICs of ≤25 µM against S. aureus (Fig. 3a and Data S3). In detail, 19 out of 20 anti-E. coli candidates and 8 out of 18 anti-S. aureus candidates demonstrated antibacterial activity (Fig. 3a and Data S3). Overall, the success rate for designing target-specific AMPs was 95% for Gram-negative bacteria (E. coli) and 44.4% for Gram-positive bacteria (S. aureus). Among the validated AMP candidates, 9 exhibited inhibitory effects against both Gram-negative and Gram-positive bacteria (Fig. 3a, c). Notably, AMP-15 showed the most potent inhibitory activity, with MIC values of 0.71 µM against E. coli and 1.41 µM against S. aureus.

To further assess the potential of AMP candidates as antibiotic alternatives, we selected the nine top-performing peptides for additional antibacterial and hemolytic assays. Notably, the sequences generated by AMPGen demonstrated strong efficacy (Fig. 3c, Data S4, and Fig. S6). Specifically, these nine most effective AMP candidates, each demonstrating activity against both Gram-negative and Gram-positive bacteria with MIC < 5 μM, were subjected to further antibacterial analysis against additional strains of S. aureus (ATCC 25923) and E. coli (ATCC25922), as well as other pathogens including P. aeruginosa (ATCC 27853) and E. faecalis (ATCC 29212). The results (Fig. 3c and Data S4) revealed that all selected candidates displayed strong inhibitory effects against all four tested bacteria, except for AMP-17 and AMP-20, which exhibited MIC values above 25 μM against P. aeruginosa. Although we initially designed AMP sequences for specific targets (E. coli and S. aureus), AMPGen’s modules effectively captured functional information from the AMP dataset within the model’s latent space, enabling the target-specific design of AMP sequences. The results indicated that AMPGen successfully generated potent AMPs targeted at specific pathogens, while some of the designed peptides also exhibited broad-spectrum antimicrobial properties.

This antibacterial mode of action was further confirmed by propidium iodide staining and microscopy (Fig. 4), which clearly indicated membrane disruption upon AMP treatment. For AMPs to serve as effective antibiotic alternatives, they would ideally exhibit strong antibacterial activity with minimal hemolytic effects, indicating selectivity for targeting bacteria over human cells. Comparative analyses of cytotoxic concentration (CC50), hemolytic concentration (HC50), and minimum inhibitory concentrations (MICs) against E. coli (K88) and S. aureus (ATCC 29213) indicated favorable selectivity profiles between antibacterial activity and hemolytic effects (Figs. 4a, b, S6 and S7).

Fig. 4: Bioactivity characterization and action of functional AMPs on E. coli.
figure 4

a Quantification of membrane damage in E. coli K88 treated with different AMPs using propidium iodide fluorescence. The fluorescence intensity, indicative of membrane permeabilization, is shown as mean ± standard deviation (n = 3). The negative control (NC) represents untreated E. coli K88, while other bars represent various AMPs. Polymyxin B is included as reference antibiotics. Statistical significance is indicated by asterisks (***), P < 0.001 (one-way ANOVA). b Half-maximal cytotoxic concentration (CC50) and half-maximal hemolytic concentration (HC50) values of AMP candidates, along with minimum inhibitory concentrations (MIC) against E. coli K88 and S. aureus (ATCC 29213). All experiments were conducted in triplicate (n = 3 independent experiments) and the results were averaged. Concentration values are expressed in log10 μg/mL. c Fluorescence microscopy images of E. coli K88 cells untreated (control) or treated with Polymyxin B (positive control) and AMPs. Red fluorescence indicates propidium iodide staining, which binds to DNA upon cell membrane disruption, highlighting compromised bacterial cells, marking compromised bacterial cell membranes. Untreated E. coli K88 serves as the control, displaying minimal fluorescence. Scale bar = 5 μm. All experiments were performed in triplicate, yielding reproducible outcomes. A representative figure is presented to illustrate the findings.

Effectively assimilated knowledge of the AMPs

To further characterize the conformations and uniqueness of the validated AMP sequences, we analyzed their predicted structures and conducted sequence similarity searches in relevant databases. Based on the conformational characteristics, AMPs can be broadly categorized into α-helical peptides, β-sheet-containing peptides, structured linear peptides, and other mixed-structure peptides14,22. In this study, we employed AlphaFold3 to predict the conformations of the 38 experimentally verified AMP candidates designed by AMPGen. Based on the results of AlphaFold, the majority of these sequences were identified as α-helical AMPs, followed by β-sheet-containing AMPs, with some also classified as structured linear AMPs, including AMP11, AMP14, and AMP33 (Fig. 3a). Additionally, we used PepFold4, a tool specifically designed for predicting the conformation of short peptides (typically less than 36 amino acids), for comparison32. PepFold4 and AlphaFold generated largely comparable predictions (Fig. S8). It is important to note that this represents the calculated preferred structure. However, the conformations may undergo changes due to the inherent flexibility of the peptide, such as when the AMP interacts with the biofilm. This diversity in conformational structures among the designed AMPs indicates that AMPGen has effectively assimilated comprehensive evolutionary information from the OpenFold database and functional information from the AMP-MSA conditional dataset.

Furthermore, the AMPs designed by AMPGen have not been previously reported in any existing databases. A comparative analysis of the validated sequences against the non-redundant (nr) protein sequence database —which includes entries from GenPept, SwissProt, PIR, PDF, PDB, and NCBI RefSeq — using BLAST revealed no significantly matching sequences (Data S5). Among the 40 sequences analyzed, 18 showed no hits, while the remaining 22 exhibited a percent identity of 72.45% ± 12.0% and a query cover of 83% ± 20.0% (Table S2 and Data S6). Additionally, the AMP candidates showed evolutionary diversity (Data S7). These findings suggest that AMPGen is capable of successfully designing active AMP sequences that are not currently identifiable through existing data mining approaches.

Although the modules involved in AMPGen function as black box models, the compelling verification results clearly demonstrate its ability to learn within the latent space and uncover hidden patterns and principles in protein evolutionary data.

Discussion

Recent advancements in generative models have enabled the development of protein generation systems capable of autonomously creating proteins from scratch. There have been works based on deep generative models to generate AMPs17,21,33,34,35,36,37,38,39,40. For instance, PepGAN, a GAN-based AMP generation model, produced 6 top-ranked peptides, of which only one exhibited a notably potent effect with an MIC of 3.1 μg/mL21. Another study explored the extensive virtual peptide space by enumerating a vast number of sequences composed of 6-9 amino acids. This approach successfully identified several active AMPs, including three hexapeptide AMPs. However, as the length of peptide sequences increases, the data volume and computational demands grow exponentially, posing significant challenges to scalability and feasibility39. In a different study40, a variational autoencoder (VAE) model pre-trained on approximately 1.5 million peptide sequences from the UniProt database was employed. This model, when fine-tuned through transfer learning on a smaller dataset of around 5,000 experimentally verified AMPs, enabled the generation of peptide sequences. A CNN/RNN model was subsequently employed to predict the MIC of these candidates, facilitating their ranking as AMP candidates. This approach successfully identified 500 potential AMPs, of which 30 were experimentally confirmed to exhibit antibacterial properties. The above models did not pre-train on the complete protein dataset. It has also been showing that integrating the pretrained protein language models, such as ProtT5 and ESM-2, with diffusion models is an effective strategy to generate peptides34,41. Besides, some studies have used generative models to design AMP sequences; however, they did not perform functional verification through physical synthesis and experimental evaluation42,43. Experimental validation is essential for accurately assessing the functional potential of generated sequences. In this study, we synthesized the sequences generated by our model and determined their MIC values following the Clinical and Laboratory Standards Institute (CLSI, M100, 30th ed.). To strengthen the evaluation, we also synthesized representative sequences from previous studies and measured their MIC values under the same conditions for parallel validation and comparison (Tables S3 and S4, Data S8).

AMPGen stands at the forefront of AMP design by generating a wider variety of AMP sequences with strong antibacterial activity (Tables S3 and S4, Data S8, Fig. S9). As variations in target strains and synthesis methods can significantly affect the antimicrobial efficacy of AMPs, direct comparisons across studies are challenging. To address this issue, we selected six of the top-performing sequences from various studies (Data S8), with lengths less than 35 amino acids (where available) and the lowest reported MICs against E. coli (Table S4) These sequences were synthesized and tested for MIC using the same protocol as our AMP. The results demonstrated that AMPGen-generated sequences performed comparably to the top-ranked sequences (Data S4 and Table S4). Then we calculated the pairwise identity of the sequences designed by AMPGen and reported works, which is a common method for quantifying sequence similarity and can help assess how novel or diverse the generated sequences are. Sequence alignment was performed using the BLOSUM62 matrix applied as the scoring scheme. The results indicated that AMPGen generated the most diverse sequences, exhibiting the lowest average pairwise identity (Fig. S9). This suggested that AMPGen was capable of exploring a broader sequence space compared to existing models, which may be attributed to its integration of sequence evolutionary information into the diffusion-based generation process. The resulting low redundancy among AMPGen-generated sequences underscores its strong potential for innovation in AMP design, making it a promising tool for expanding the repertoire of antimicrobial therapeutics. While the incorporation of MSAs introduces some additional computational cost, this overhead remains relatively modest and does not substantially impact the model’s overall efficiency.

Diffusion models are particularly well-suited for generating peptides due to their capacity to produce diverse outputs and their underlying mechanisms, which can mimic the natural processes of protein evolution44. In living organisms, gene mutations occur as changes in the DNA sequence, which can alter codons—the triplet sequences of nucleotides in DNA or mRNA that specify particular amino acids. When mutations occur, they can result in different codons being formed during transcription and translation, potentially altering the amino acid sequence of the resulting protein45. These changes can impact the protein’s structure and function, leading to various biological consequences. Interestingly, the evolutionary process of proteins in living organisms mirrors the point-by-point addition and reduction of noise within an order-independent diffusion model25. In this analogy, the modulation of noise at each site corresponds to amino acid “mutations” within the latent space, with the generator trained to recognize permissible computational “mutations” within the vast universal protein library. The generator, based on an order-agnostic autoregressive diffusion model, captures the evolutionary information inherent in amino acid sequences, with the goal of generating sequences that are both biologically plausible and evolutionarily sound. During protein evolution, mutations occur randomly and are finally represented as matrixed MSA data for input into the model. As a result, the order-agnostic model is well-suited to learning protein evolution. Considering the conformation heterogeneity of peptides, we implemented the generation module based on protein evolution, which does not rely on PDB structural data.

Generating proteins with specific functions represents one of the most promising yet challenging frontiers in the application of large-scale models. A key challenge in this field is establishing a reliable association between protein sequence and function, particularly given the scarcity of functional protein data. To overcome these challenges, we employ a modular cascade model that enhances the accuracy of peptide generation. Cascade models improve accuracy by incrementally refining decisions or predictions46. Initial stages can quickly eliminate obvious cases, allowing subsequent stages to focus on more complex or nuanced data. In AI cascade models, the interdependence of modules can enhance overall model performance. However, errors or inaccuracies in any single module can propagate and magnify throughout the processing chain, ultimately leading to a substantial degradation or even complete system failure—a phenomenon known as cascading failure. In AMPGen, the discriminator plays a crucial role in mitigating cascading failures. This is important because all scorers within the model are trained on antibacterial data from empirical experiments. Given that the antibacterial dataset predominantly consists of AMPs, the scorers exhibit low confidence when evaluating non-AMP sequences. To address this limitation, we introduce a discriminator trained on both AMP and non-AMP datasets, which we specifically curated. We deliberately selected different models for the discriminator and scorer modules to avoid redundancy and allow task specialization. This enhancement mitigates scoring bias, leading to more reliable performance across a broader range of sequences.

We designed an XGBoost-based classifier as the discriminator and an LSTM as the scorer to leverage their complementary strengths. One of the key benefits of XGBoost is its robust performance in handling tabular data and its ability to effectively capture complex, non-linear relationships between features, particularly in scenarios where feature importance can guide decision-making. XGBoost performed well in leveraging features from peptide sequences47. On the other hand, LSTM networks are specifically designed to handle sequential data and can capture long-range dependencies between amino acids in a peptide, which is crucial for understanding its function. And LSTMs, through their non-linear nature, are well-suited for regression tasks where the relationship between input features (the peptide sequences were embedded by ESM-2) and the output (MIC values) is non-linear39. Given the complexity of biological data, linear relationships are unlikely to be sufficient, necessitating the use of machine learning methods. Both XGBoost and LSTM were capable of capturing non-linear relationships: XGBoost does so by combining multiple decision trees to model complex decision boundaries, while LSTM achieves this through deep neural network layers in which each neuron applies a non-linear activation function to its input. It is worth noting that the design and structure of AMPGen can also be applied to the development of other short peptide families, such as therapeutic peptides capable of crossing membranes or targeting intracellular sites.

It is not unexpected that designed peptides generally retain broad-spectrum properties because the design strategy focused on selecting sequences with high scores for a specific target, while not excluding potential applicability to other targets. Moreover, AMPs often exhibit broad-spectrum activity due to their conserved mechanisms of bacterial membrane disruption; achieving strict species selectivity remains a challenge. Additionally, the training dataset of the scorer is based on the species level, while in reality, the same AMP has different antibacterial effects on different strains of the same species, which introduces noise to the target-specific scorer. Therefore, developing models for target-specific AMPs remains a challenge and may require the use of more precise negative datasets.

In conclusion, the incorporation of diffusion models and evolutionary information within a cascade architecture provides a promising approach for the design of functional peptides. By leveraging the diversity-generating capabilities of diffusion models and the informative power of evolutionary data, AMPGen addresses key challenges in de novo AMP design. The resulting AMPs exhibit broad-spectrum activity and improved efficacy, offering a robust solution to the growing problem of antimicrobial resistance. This innovative methodology not only enhances the efficiency of AMP generation but also paves the way for the development of functional peptides. At the same time, this study has certain limitations. Although we have generated candidate AMPs, there remains a long path toward clinical application. Further experimental validation at advanced stages will provide stronger driving forces for future model development. Overall, while generative large models hold great promise for short peptide design, challenges such as the limitations of biological data, unknown functional landscapes, and model interpretability still remain. Foundational models that are better aligned with the complexity of protein language are still a vision.

Methods

Dataset preparation

Both AMP and non-AMP datasets were used in the AMP classification model and the MIC regression prediction model, and the AMP dataset is used as a hint when generating AMP (Fig. S10).

The AMP dataset was compiled from six public AMP databases: APD48, DADP49, DBAASP50, DRAMP51, YADAMP52, and dbAMP53. AMP data from these various datasets is merged, deduplicated, and filtered to remove incomplete or meaningless data entries. Finally, our database consisted of 10,249 unique sequences, and 9854 items have antibacterial targets of bacteria.

The negative dataset (non-AMP sequences) was sourced from the UniProt database54. All the sequences from the UniProt database with lengths ranging from 5 to 65 amino acids, were filtered to exclude those associated with 17 specific keywords: antimicrobial, antibiotic, antibacterial, antiviral, antifungal, antimalarial, antiparasitic, anti-protist, anticancer, defense, defensin, cathelicidin, histatin, bacteriocin, microbicidal, fungicide and toxin. As with the positive dataset, sequences containing ambiguous amino acids (indicated by U, O, B, Z, J, or X) were excluded. This resulted in a total of 11989 peptide sequences labeled with non-AMP.

De novo AMP generation

Two pre-trained order-agnostic autoregressive diffusion models (OADM) were deployed for de novo AMP sequence generation, one was trained on amino acid sequence data, and the other on evolutionary multiple sequence alignment (MSA) data55. In both models, the length of the generated sequences was set to 15–35 amino acids. This is a common length for AMP and a synthetically friendly and suitable for industrial synthesis. Sequences containing ambiguous amino acids (indicated by U, O, B, Z, J, or X) were excluded from the dataset.

The diffusion model used in this study generalizes traditional left-to-right autoregressive models by allowing sequence generation in any arbitrary order, not just a fixed left-to-right progression. This flexibility is particularly advantageous for generating short peptide sequences like AMPs, where the order of amino acids does not necessarily follow a natural or obvious progression.

Mathematically, the model operates by first sampling a random decoding order \({{{\rm{\sigma }}}}\) from all possible orders \({S}_{L}\), where L is the sequence length. The log-likelihood of generating a sequence x is then expressed as an expectation over all possible decoding orders:

$$\log p\;\left(x\right)\approx E{{{\rm{\sigma }}}}\sim U\left({S}_{L}\right)\left[\sum t={1}^{L}\log p\left({x}_{{{{\rm{\sigma }}}}\left(t\right)}|{x}_{{{{\rm{\sigma }}}}\left( < t\right)}\right)\right]$$

Here, \({x}_{{{{\rm{\sigma }}}}\left(t\right)}\) denotes the amino acid at position t in the sequence according to the order \({{{\rm{\sigma }}}}\), and \({{{{\rm{x}}}}}_{{{{\rm{\sigma }}}}\left( < {{{\rm{t}}}}\right)}\) represents all preceding amino acids in this order.

The model is trained by minimizing the loss function derived from this log-likelihood, which involves predicting the probability distribution of each amino acid in the sequence conditioned on its preceding amino acids as determined by the randomly sampled order \({{{\rm{\sigma }}}}\). This training process allows the model to learn from predictions of all masked positions at each timestep, thus generalizing the autoregressive framework to consider all possible decoding orders.

The sequence-based model Evodiff-OA_DM_640M was pre-trained on Uniref50 dataset which contains 42 million protein sequences, provided by the UniProt Reference Clusters databases56. The model adopted ByteNet, a CNN architecture, which runs in time that is linear in the length of the sequences and sidestepped the need for excessive memorization57. We use the sequence-based model to unconditionally generate peptide sequences of length 15–35 amino acids. Generating 50,000 sequences required approximately 5 days using a single NVIDIA A6000 GPU.

The MSA-based model Evodiff-MSA_OA_DM_MAXSUB was trained on the OpenFold dataset, containing 401,381 MSAs for approximately 132,000 unique Protein Data Bank sequences and approximately 15 million UniClust30 clusters58. This model adopted a 100 M parameter MSA Transformer architecture, which processes the 2D MSA input by interleaving row and column attention across the MSA matrix59.

Mathematically, the MSA Transformer represents the input MSA as a matrix \(x\in {R}^{M\times L}\), where M is the number of sequences and L is the sequence length. The model applies axial attention, alternating between attention over rows and columns of the matrix, thereby reducing the computational complexity to \({{{\mathscr{O}}}}\left(M{L}^{2}\right)\) for row attention and \({{{\mathscr{O}}}}\left(L{M}^{2}\right)\) for column attention. Additionally, tied row attention is employed to share attention maps across sequences, leveraging the shared structure among the aligned sequences, and using square-root normalization to maintain consistent attention weights across the MSA. The model is pre-trained using a masked language modeling (MLM) objective, where the loss function is given by:

$${L}_{{MLM}}\left({x;}{{{\rm{\theta }}}}\right)={\sum}_{\left(m,i\right)\in {\mbox{mask}}}\log p\left({x}_{{mi}}|\widetilde{x};{{{\rm{\theta }}}}\right).$$

Here, \(\left(m,i\right)\) refers to the masked positions, and \(p\left({x}_{{mi}}|\widetilde{x};{{{\rm{\theta }}}}\right)\) represents the probability of correctly predicting the masked amino acid at position \(\left(m,i\right)\), given the masked MSA \(\widetilde{x}\).

Using MSA-based model, we generate amino acid sequences in two separate ways, the one is to unconditionally generate peptide sequences of length 15-35 amino acids (MSA-based), and the other is to mask MSAs with known AMP sequences as representative sequences for conditional generation (MSA-conditional). For MSA-conditional generation, we first constructed an AMP MSA dataset as the diffusion model input. MSAs were generated for each AMP sequence in the AMP dataset described above by searching UniClust3060 with HHblits23. Sequences that cannot be clustered to obtain MSAs were removed. The average depth of the MSAs is 34.5. The generated MSA is in A3M format. While in the MSA-based generation, an uniformly sampled inputs were used. In both MSA-conditional (generating 70,000 sequences required approximately 6 days using a single NVIDIA A6000 GPU) and MSA-based (generating 70,000 sequences required approximately 72 days using a single NVIDIA A6000 GPU) generation, the generated sequence length was set to 15-35 amino acids.

XGboost-based AMP classification

Dataset preparation. Sequence in the AMP dataset described above were filtered based on length, retaining those within the range of 5 to 65 amino acids, which resulted in a total of 9964 AMP-labled peptide sequences as the positive dataset.

Prior to training the XGboost machine learning model, the peptide sequences in the dataset were converted into numerical values called features. We adopted the feature set reported in the literature for machine learning prediction of MIC values61. These representative features for encoding AMP sequences were selected from an initial set of 4481 features61. To be more specific, the feature selection was performed using a two-step process. Initially, a Random Forest (RF) model with 400 trees was employed to rapidly assess the regressive performance of each of the features examined. The DBAASP40 dataset (total 3929 items) was randomly split into training, validation, and test sets, with the test set comprising 10% of the data. The validation set was created by removing 10% of the remaining data. To ensure reliable error estimation, this splitting process was repeated three times, with each experiment run in triplicate, the average pearson correlation coefficient (PCC) scores in the three validation sets were calculated for each feature type to identify the top performers as follows:

$${PCC}={{{\rm{\rho }}}}=\frac{{\sum }_{i=1}^{N}\left({y}_{i}-{{{{\rm{\mu }}}}}_{y}\right)\left(\hat{{y}_{i}}-{{{{\rm{\mu }}}}}_{\hat{y}}\right)}{\sqrt{{\sum }_{i=1}^{N}{\left({y}_{i}-{{{{\rm{\mu }}}}}_{y}\right)}^{2}{\sum }_{i=1}^{N}{\left(\hat{{y}_{i}}-{{{{\rm{\mu }}}}}_{\hat{y}}\right)}^{2}}},$$

where \({y}_{i}\) represents the true target value, \(\hat{{y}_{i}}\) the predicted value, \({{{{\rm{\mu }}}}}_{y}\) and \({{{{\rm{\mu }}}}}_{\hat{y}}\) are the means of the true and predicted values respectively, and N is the total number of samples. This step allowed us to rank the feature types based on their predictive power.

Subsequently, the identified top-performing feature types were then evaluated using a Multi-Branch-CNN-Attention model. Throughout these evaluations, A variety of combinations of these features were tested to find the optimal combination. The models were trained using five-fold cross-validation, and the performances were compared. The optimal number of features was ultimately selected based on their average PCC scores, leading to the identification of the best feature combination that yielded the best predictive accuracy for MIC values against E. coli. The best feature combination was validated in machine learning models including TML17 model (which automates the evaluation of 17 different machine learning models), RF model, and the Support Vector Machine (SVM) model.

We applied the above optimal feature combination to the XGBoost binary classification model. The selected features were primarily derived from the PseKRAAC encoding method28, which includes various clustering types encoding parameters, as well as the QSOrder29 encoding method. The final selection comprised 14 categories encompassing 1311 features.

In model training, following the feature engineering process, the data was used to train an XGBoost model62. In this training process, AMP sequences were labeled as 1, and non-AMP sequences were labeled as 0. The XGBoost model optimizes a regularized objective function, which is designed to balance the model’s accuracy and complexity, thereby preventing overfitting. The objective function is defined as:

$$L\left(\phi \right)={\sum }_{i=1}^{n}l\left(\hat{{y}_{i}},{y}_{i}\right)+{\sum }_{k=1}^{K}\Omega \left({f}_{k}\right),$$

where \(\hat{{y}_{i}}={{{\rm{\phi }}}}\left({x}_{i}\right)={\sum }_{k=1}^{K}{f}_{k}\left({x}_{i}\right)\) is the prediction for instance \({x}_{i}\), and \(\Omega \left(f\right)\) is the regularization term:

$$\Omega \left(f\right)={{{\rm{\gamma }}}}T+\frac{1}{2}{{{\rm{\lambda }}}}{\sum }_{j=1}^{T}{w}_{j}^{2},$$

with T representing the number of leaves in the tree, \({w}_{j}\) the weight of the \((j){th}\) leaf, \({{{\rm{\gamma }}}}\) controlling the number of leaves, and\({{{\rm{\lambda }}}}\) controlling the \({L}_{2}\) norm of the leaf weights.

During training, the model is built in an additive manner, optimizing the following objective at each iteration t:

$${L}^{\left(t\right)}={\sum }_{i=1}^{n}l\left({y}_{i},\widehat{{y}_{i}^{\left(t-1\right)}}+{f}_{t}\left({x}_{i}\right)\right)+\Omega \left({f}_{t}\right),$$

which is approximated using a second-order Taylor expansion:

$${L}^{\left(t\right)}\approx {\sum }_{i=1}^{n}\left[{g}_{i}{f}_{t}\left({x}_{i}\right)+\frac{1}{2}{h}_{i}{f}_{t}{\left({x}_{i}\right)}^{2}\right]+\Omega \left({f}_{t}\right),$$

where \({g}_{i}\) and \({h}_{i}\) are the first and second-order gradients, respectively.

Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting. The F1 score is a metric that considers both precision and recall to compute a balanced measure of a model’s accuracy, particularly when dealing with imbalanced classes. It is defined as the harmonic mean of precision and recall:

$${{{\rm{F1}}}}\;{Score}=2\times \frac{{{{\rm{Precision}}}}\times {{{\rm{Recall}}}}}{{{{\rm{Precision}}}}+{{{\rm{Recall}}}}^{\prime} }$$

where recall (also known as sensitivity or true positive rate) is the proportion of actual positive cases correctly identified by the model, calculated as:

$${{{\rm{Recall}}}}=\frac{{TP}}{{TP}+{FN}^{\prime} }$$

with TP representing true positives and FN representing false negatives.

The optimal split at each node during training was determined by maximizing the gain:

$${{{\rm{Gain}}}}=\frac{1}{2}\left[\frac{{\left({\sum}_{i\in {I}_{L}}{g}_{i}\right)}^{2}}{{\sum}_{i\in {I}_{L}}{h}_{i}+{{{\rm{\lambda }}}}}+\frac{{\left({\sum}_{i\in {I}_{R}}{g}_{i}\right)}^{2}}{{\sum}_{i\in {I}_{R}}{h}_{i}+{{{\rm{\lambda }}}}}-\frac{{\left({\sum}_{i\in I}{g}_{i}\right)}^{2}}{{\sum}_{i\in I}{h}_{i}+{{{\rm{\lambda }}}}}\right]-{{{\rm{\gamma }}}},$$

where \({I}_{L}\) and \({I}_{R}\) represent the left and right child nodes after the split.

Shrinkage (learning rate \({{{\rm{\eta }}}}\)) was applied to each tree’s predictions to scale them:

$$\widehat{{y}_{i}^{\left(t\right)}}=\widehat{{y}_{i}^{\left(t-1\right)}}+{{{\rm{\eta }}}}\;{f}_{t}\left({x}_{i}\right),$$

It helps to prevent overfitting by reducing the impact of each individual tree. The best-performing model, determined through cross-validation, was then utilized for subsequent analyses.

The hyperparameters to be tuned included learning rate (lr), number of estimators (ne), and maximum depth (md). Grid search was used to find the optimal combination of hyperparameters. Specifically, the lr was tested with values of [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7], the ne was varied from 50 to 2000 in increments of 100, and the md was varied from 4 to 15. In each iteration, the model was trained and validated exclusively on the 10-fold cross-validation. The final chosen hyperparameters were those that achieved the best validation performance (lr = 0.1, md = 5, ne = 2000).

LSTM regression-based MIC scoring

Dataset preparation. All entries in the AMP dataset described above with MIC values were included. Sequences containing ambiguous amino acids, shorter than 5 amino acids, or longer than 65 amino acids were removed. The modeling of MIC prediction is mainly aimed at two bacteria, Escherichia coli (Gram-negative representative) and Staphylococcus aureus (Gram-positive representative). The AMP sequences targeting Escherichia coli totaled 7100, while those targeting Staphylococcus aureus totaled 6482. For sequences with multiple MIC values targeting the same target, the arithmetic mean of the MIC values was calculated. And all values were converted to a uniform unit of μM. The MIC values in the unified unit were then log-transformed (log10). To keep the number of positive and negative samples balanced, we randomly selected 60% of the negative dataset in the previous section of XGBoost, resulting in 7193 sequences, all of which were labeled with a log MIC value of 4. A log MIC of 4 is interpreted as indicating that these sequences did not possess antimicrobial activity, as they would require very high concentrations to exhibit any inhibitory effects against bacterial growth. The labeling helps in training the regressor model.

Two methods for feature extraction and embedding of the datasets were compared: using the same feature extraction as XGBoost and utilizing a pre-trained protein language model, ESM2-t36-3B, for embedding6. The ESM-2 model, a transformer-based architecture, employs a masked language modeling (MLM) objective, where the model predicts the identity of masked amino acids in a sequence based on their surrounding context. Mathematically, the probability of a contact between two amino acids i and j, denoted as \(p\left({c}_{{ij}}\right)\), is calculated using the formula:

$$p\left({c}_{{ij}}\right)={\left(1+\exp \left(-{{{{\rm{\beta }}}}}_{0}-{\sum }_{l=1}^{L}{\sum }_{k=1}^{K}{{{{\rm{\beta }}}}}_{{kl}}{a}_{{ij}}^{{kl}}\right)\right)}^{-1},$$

where \({{{{\rm{\beta }}}}}_{0}\) is a bias term, L is the number of layers, K is the number of attention heads, and \({a}_{{ij}}^{{kl}}\) represents the symmetrized and APC-corrected attention map values for the \(k\)th attention head in the l-th layer. The model minimizes the perplexity, defined as:

$${{{\rm{Perplexity}}}}(x)=\exp \left(-\frac{1}{L}{\sum }_{i=1}^{L}\log p\left({x}_{i}|{x}_{\ne i}\right)\right)$$

where L is the length of the sequence, and \(p\left({x}_{i}|{x}_{\ne i}\right)\) represents the conditional probability of the i-th amino acid given the rest of the sequence. This formulation enables the ESM-2 model to capture intricate structural dependencies within protein sequences, which are then embedded in a high-dimensional space where structurally and functionally similar proteins are closer together. The embedding method using the pre-trained protein language model ESM-2 demonstrated superior performance according to the Mean Squared Error (MSE) evaluations.

In model training, separate regression models were trained on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) neural networks63. In general, the model consists of two LSTM layers (with a hidden size of 128, batch size of 64, and 200 epochs), one dropout layer with a dropout rate of 0.7, and a fully connected layer. Specifically, the datasets were divided into training, validation, and test sets with a split of 72:18:10, respectively. Each model consisted of two LSTM layers, where the LSTM architecture utilized gates to manage information flow through the network. Specifically, at each time step t, the input gate \({i}_{t}\), forget gate \({f}_{t}\), and output gate \({o}_{t}\) were computed as follows:

$${i}_{t}={{{\rm{\sigma }}}}\left({W}_{i}{x}_{t}+{R}_{i}{y}_{t-1}+{p}_{i}\odot {c}_{t-1}+{b}_{i}\right),$$
$${f}_{t}={{{\rm{\sigma }}}}\left({W}_{f}{x}_{t}+{R}_{f}{y}_{t-1}+{p}_{f}\odot {c}_{t-1}+{b}_{f}\right),$$
$${o}_{t}={{{\rm{\sigma }}}}\left({W}_{o}{x}_{t}+{R}_{o}{y}_{t-1}+{p}_{o}\odot {c}_{t}+{b}_{o}\right).$$

The cell state \({c}_{t}\) was updated according to the equation:

$${c}_{t}={z}_{t}\odot {i}_{t}+{c}_{t-1}\odot {f}_{t},$$

where \({z}_{t}=\tanh \left({W}_{z}{x}_{t}+{R}_{z}{y}_{t-1}+{b}_{z}\right)\).

The final output \({y}_{t}\) was derived from:

$${y}_{t}={o}_{t}\odot \tanh \left({c}_{t}\right).$$

During training, gradients were computed using Backpropagation Through Time (BPTT). The gradient for each gate and cell state was calculated to update the network weights. For example, the gradient of the loss with respect to the output gate \({{{{\rm{\delta }}}}}_{{o}_{t}}\) was computed as:

$${{{{\rm{\delta }}}}}_{{o}_{t}}={{{{\rm{\delta }}}}}_{{y}_{t}}\odot \tanh \left({c}_{t}\right)\odot {{{{\rm{\sigma }}}}}^{{\prime} }\left(\bar{{o}_{t}}\right),$$

where \({{{{\rm{\delta }}}}}_{{y}_{t}}\) is the gradient passed from the subsequent layer, and \({{{{\rm{\sigma }}}}}^{{\prime} }\left(\cdot \right)\) represents the derivative of the sigmoid function.

Similarly, the gradients for the cell state \({{{{\rm{\delta }}}}}_{{c}_{t}}\), forget gate \({{{{\rm{\delta }}}}}_{{f}_{t}}\), input gate \({{{{\rm{\delta }}}}}_{{i}_{t}}\), and cell input \({{{{\rm{\delta }}}}}_{{z}_{t}}\) were computed to propagate the error back through the network and update the weights accordingly. To prevent overfitting, a dropout layer with a dropout rate of 0.7 was incorporated after the LSTM layers. The output layer was a linear transformation, and the models were trained using L2 loss, optimizing with the Adam optimizer. This configuration, along with precise gradient calculations, effectively captured temporal dependencies in the sequence data, leading to accurate regression predictions.

Candidate AMP sequence selection for validation

Following the de novo sequence generation using the three approaches (sequence-based, MSA-based, and MSA-conditional), all the resulting peptide sequences were screened using the XGBoost-based discriminator to identify those predicted to be AMPs. The AMP-classified sequences were then evaluated using the scorer specific to E. coli and S. aureus, respectively. From the top 100 candidates for each target, 20 sequences were randomly selected from each. In total, 40 sequences were chosen for wet-lab validation.

Measurement of minimum inhibitory concentration (MIC)

Chemical peptide synthesis. All peptides used in the experiments were purchased from GenScript and synthesized using their PepPower™ platform, which combines liquid-phase and solid-phase peptide synthesis techniques, and each peptide was confirmed to have a purity greater than 80%. (Data S3). High-Performance Liquid Chromatography (HPLC) and Mass Spectrometry (MS) were used to determine the concentration of chemically synthesized AMPs. HPLC analysis was performed using an Inertsil ODS-SP (4.6 × 250 mm) column with a flow rate of 1 mL/min. The mobile phase consisted of (A) 0.065% trifluoroacetic acid (TFA) in 100% water and (B) 0.05% TFA in 100% acetonitrile. A gradient elution program was employed as follows: 0–25 min (5–65% B), 25–27 min (65–95% B), 27–35 min (5% B). The detection wavelength was set at 220 nm. The MS analysis was further performed using an electrospray ionization (ESI) interface on an Agilent 6200 time-of-flight (TOF) LC/MS system equipped with a UV detector. Finally, 38 sequences out of the 40 candidates above were successfully chemically synthesized.

The MIC of the designed AMPs was determined using the broth microdilution method according to established protocols64. The bacterial strains used in this study included E. coli K88, S. aureus (ATCC 29213), E. coli (ATCC 25922), E. faecalis (ATCC 29212), S. aureus (ATCC 25923), and P. aeruginosa (ATCC 27853). The synthesized peptides were dissolved in 1% DMSO at a concentration of 4.5 mg/mL for storage (10×). For the MIC assay, the peptide solutions were subjected to two-fold serial dilutions across columns 1-10 of sterile 96-well plates, with columns 11 and 12 serving as controls containing Mueller-Hinton Broth (MHB). Each dilution (50 µL) was pipetted into the wells before the addition of bacterial suspension. The plates were kept covered when not in use to prevent contamination.

Bacterial cultures were grown in MHB with shaking at 37 °C overnight, then diluted to 1.0 × 106 CFU/mL with MHB. A 50 µL aliquot of each bacterial suspension was added to columns 1–11 of the 96-well plates, which contained the peptide solutions, while column 12 contained only MHB. The plates were incubated at 37 °C for 16–20 h. The MIC assays define the AMP concentration that precludes growth after incubation. All assays were performed in triplicate to ensure statistical reliability.

Assessment of cytotoxicity

A cytotoxicity assay was conducted on IEC-6 intestinal epithelial cells (ATCC, #CRL-1592™) using the CellTiter 96® AQueous One Solution Cell Proliferation Assay (Promega, #G3580). Cells were cultured in media containing DMEM high glucose (BBI, #E600003-0500), 10% fetal bovine serum (FBS, Pricella, #164210-50), and 10% antibiotic mix for cell culture (Solarbio, #P1400). Once the cells reached a surface coverage density of 50-80%, they were harvested, diluted to a density of 5×10³ cells per 100 µL, and seeded into 96-well plates for 24 h.

Following this incubation, the cells were exposed to a series of peptide concentrations, prepared by two-fold serial dilutions starting from 1125 μg/mL. A 4 µL aliquot of each dilution was added to the wells, and the plates were incubated at 37 °C with 5% CO2 for another 24 h. Subsequently, 20 µL of CellTiter 96® AQueous One Solution was added to each well, and after a 90-min incubation at 37 °C, absorbance was measured at 490 nm, with corrections made using wells containing only medium. CC50 values, representing the concentration of each AMP that kills 50% of the cells, were determined by fitting a 4-parameter logistic model (4PL) using Python’s scipy library.

Assessment of hemolytic activity

Rat erythrocytes (Sbjbio, #SBJ-RBC-RAT004) were washed with Phosphate Buffer(PBS, Pricella, #PB180327) and resuspended to approximately 2% in PBS. Antimicrobial peptides (AMPs) were initially prepared at 1125 μg/mL and then serially diluted in 1.5 mL Eppendorf tubes. Each 20 µL AMP dilution was combined with 180 µL of the 2% erythrocyte suspension. The mixtures were incubated at 37 °C for 1 h after sealing. Following centrifugation at 1000×g for 5 min at room temperature, 50 µL of the supernatant was transferred to a 96-well plate, where absorbance was measured at 540 nm. Triton X-100-treated erythrocytes served as a positive control, while PBS-treated erythrocytes acted as a negative control. The HC50 values, representing the concentration of AMP required to lyse 50% of the RBCs, were determined by fitting a 4-parameter logistic model (4PL) using Python’s scipy library.

Statistics and reproducibility

The MIC determination, cytotoxicity assays, and hemolytic activity tests were performed in three independent biological replicates. The source data were provided in the Supplementary Information or Supplementary Data. Statistical analyses were conducted using GraphPad Prism 9, and significance was determined using one-way ANOVA or other methods as indicated. Reproducibility was confirmed across independent experiments, and representative results are shown.

Mechanism of action analysis using propidium iodide

E.coli K88 in the exponential phase was harvested, centrifuged, resuspended in PBS, and adjusted to an OD of 1. A 10 µL aliquot of the bacterial suspension was mixed with 10 µL of AMPs at a final concentration of 4 × MIC and incubated at 37 °C for 1 h. To both AMP-treated and untreated samples, 20 μM of propidium iodide (PI, Solarbio, #C0080) was added and incubated in the dark at 37 °C for 30 min. Fluorescence was recorded with an excitation at 535 nm and emission at 615 nm using a Tecan Infinite® Eplex plate reader. Additionally, 1 µL of the reaction mixture was applied to a slide and imaged using a Leica DM4B upright microscope equipped with a 100x semi-apochromatic objective.

Analysis of peptides sequence property

The online BLASTp tool was used to blast AMP sequences against the non-redundant protein sequence (nr) database (https://blast.ncbi.nlm.nih.gov). The parameters, including word size of 6, expect threshold of 10, PAM30 matrix, gap initiation penalty of 9 and gap extension penalty of 1, conditional compositional score matrix adjustment, and low complexity regions filter, were used. The hit threshold is E value with an upper limit of 10. Physiochemical properties of peptide sequences were inferred using the Peptides package in R (https://CRAN.R-project.org/package=Peptides, v2.4.6). The annotation of peptide sequences was performed using the Entrez module of the Biopython package (http://biopython.org). To assess the sequence similarity and diversity of AMPs designed, we performed pairwise sequence alignment between the generated sequences and those reported in the literature. Sequence alignment was carried out using the pairwise2 module from the Python Biopackage. The alignment scoring was based on the BLOSUM62 substitution matrix, which implements global and local pairwise alignment algorithms.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.