Introduction

Cancer is one of the deadliest diseases and a significant public health challenge, affecting millions of individuals worldwide1,2.In 2023, the WHO reported 20 million new cancer cases and 10 million related deaths, emphasizing the need for effective prevention, diagnostic, and treatment efforts3. According to predictions, by 20504, there could be 35.3 million cancer diagnoses and 20 million fatalities. This suggests a 76.6% rise in cancer incidence and nearly doubled fatality rate from 2022. This predicted increase emphasizes how urgently comprehensive public health policies, strong early detection initiatives, and creative therapy approaches are needed5,6.Cancer arises when abnormal cells multiply uncontrollably, often due to genetic defects7,8. Some mutations are inherited, but most are induced by lifestyle and environmental changes9. Remarkably, tobacco smoking still causes 22% of cancer deaths10,11. Other risk factors include obesity, improper diet, sedentary lifestyles, excessive alcohol consumption, and carcinogenic microbe infections12,13. It is essential to address these modifiable drivers to enhance patient outcomes and reduce cancer globally.

Traditional cancer treatments like surgery, chemotherapy, and radiation have played a crucial role in fighting cancer14,15,16, but they do have their downsides like methods may harm normal cells and cause serious side effects like infections and immunosuppression17,18. Chemotherapy and radiation can harm healthy cells, resulting in side effects like fatigue and nausea19,20,21. Some cancers may lose their effectiveness over time due to resistance to different treatments. Despite improved treatments, millions of people die from cancer every year22,23,24. The high death rate indicates that we need new and more effective treatments, even with existing ones.

In this context, anticancer peptides (ACPs)25 are a feasible cancer treatment alternative. ACPs, short amino acid sequences of 10 to 60, are generated from the biological immune system and can target and cause apoptosis in neoplastic cells while preserving non-malignant cells. This concentrated approach removes standard cancer treatment side effects and reduces cytotoxicity. They also generate cancer cell resistance less than typical pharmacological treatments26,27. They work by inducing apoptosis, damaging the membranes of cancer cells, preventing angiogenesis, and modifying immunological responses28. ACPs are essential in the battle against cancer because they provide hope for more patient-friendly and efficient treatments29,30.

Over the past decade, significant advancements have been made in the design and evaluation of peptide-based cancer therapies31,32,33. Discovering new anticancer peptides is crucial. Unfortunately, these peptides are rarely employed in medicine due to strict testing and approval requirements34. Lab testing for new ACPs is costly, time-consuming, and resource-intensive35. Bioinformatics, or computer-based technology, can speed up discoveries here. Advanced DL and ML methods have replaced traditional experimental methods. For example, Ruifen Cao et al.36 developed a dual-channel deep neural network to efficiently predict anticancer peptides. ACPs like brevinin were discovered through costly methods. Tyagi et al.37 introduced CancerPPD that listed 3,491 validated ACPs, and CancerPPD2.038 expanded this to over 6,500. Moving to the traditional ML, like Tyagi et al.39 designed AntiCP, which focuses on developing computational models to predict novel anticancer peptides using machine learning, specifically support vector machines, achieving around 91% accuracy, and including a web server for mutation prediction and virtual screening. Chen et al. produced iACP40, which used g-gap dipeptide encoding and SVM. It was better than previous tools in terms of both accuracy and stability. With these web servers, it was easy to quickly screen candidate peptides. AntiCP 2.041 was released, which used the composition of dipeptides and amino acids to improve design rules. Its best model, an Extra-Trees classifier on dipeptide characteristics, achieved an AUROC of 0.97 on training data. Vijayaku Mar and Lakshmi42 created ACPP, a prediction method that uses compositional information, centroidal, and distributional metrics of amino acids to forecast ACPs more accurately. Park et al.43 released MLACP 2.0 that uses a CNN-based meta-classifier that combines scores from 67 different baseline models using various peptide encodings. This approach surpasses earlier machine learning tools in performance. Other meta and pipeline methods, such as mACPpred (2019)44 and its updated version, mACPpred 2.0 (2024)45, also combined various classifiers and features. The latest version outperformed not just its predecessor but also other industry-leading tools by introducing 1D convolutional neural networks (1D-CNNs) and protein language embeddings within a layered DL framework.

Between 2019 and 2024, DL models, particularly CNNs, emerged as crucial tools for ACP prediction, effectively learning features from sequences in an automated manner. The ACP-MHCNN model, developed by Sajid Ahmed et al.46, is a multi-headed convolutional neural network that incorporates BPF, physicochemical, and evolutionary inputs, resulting in an accuracy enhancement of around 6% over previous models. In a similar vein, Chen et al.47 proposed xDeep-AcPEP in 2021 , utilized a CNN with multitask learning to forecast continuous potency values (IC50, EC50, etc.) across six cancer cell lines; the models attained a Pearson (\(\rho \approx 0.81\)) and offered residue-level interpretability through feature-importance mapping. Han et al.48 employed bidirectional LSTM layers, enhanced with attention mechanisms, on concatenated amino-acid property encodings- ACPred-BMF , achieving state-of-the-art performance. Liu et al.49 developed an ensemble of deep models, named AntiMF , utilizing “multi-view” CNNs to collect complimentary characteristics, hence enhancing performance beyond that of single-network models. Certain studies focus on data scarcity, which was mitigated through augmentation. For instance, Bhattarai et al.50 enhanced training samples using Gaussian oversampling and implemented AdaBoost, proposing ACP-ADA in 2022, which improved accuracy and MCC by approximately 5–9% compared to prior ml methods. Xiaofang et al. proposed pretrained transformer-based methods like ACP-DRL51 refine a BERT-like language model using Bi-LSTM classifiers; this deep representation learning eliminates human features and is on par with or better than existing predictors. The model is pre-trained on about 1.5 million in-domain unlabeled peptides. Likewise, Yue et al. integrate wavelet denoising, a cascade deep forest classifier, and embeddings from pretrained protein language models (ProtT5, ESM-2). They built PLMACPred52, employed SHAP analysis to identify important motifs, and performed significantly better than previous models. Furthermore, by experimentally validating 37 peptides, Yuan et al.53 developed a deep CNN and a large-scale peptide mining strategy—showed its effectiveness by concluding that one (tPep14) was effective in mice while exhibiting no toxicity. In summary, the discipline has evolved from traditional manual detection methods to scalable, data-driven approaches for ACP identification, leveraging ml and dl techniques. This progression highlights the importance of ensemble learning, transfer learning, and experimental validation to ensure applicability in real-world scenarios.54,55,56. A recent article indicates that mACPpred 2.045 achieved an accuracy of 85.7% on a standard independent dataset. Therefore, our goal is to present a new model that surpasses all of the performance indicators of that and the other studies that have already been done related to ACP prediction.

Recent advances in deep learning and biological sequence analysis have significantly improved the prediction of ACPs. However, many existing models rely on single encoding strategies, which limits their ability to capture both contextual and structural information from peptide sequences. To address these limitations, we present a novel predictive framework that develops an Attention-based BiLSTM model featuring optimized weighted attributes. This model accurately predicts anticancer peptides by integrating multi-scale feature representations, including One-hot encoding, fastText, and GloVe. A key innovation of our approach is the identification of the best feature extractor and the creation of a hybrid feature set. Additionally, we utilize a Logistic Regression-guided Differential Evolution (DE) algorithm to dynamically optimize feature weights before model training. This evolutionary weighting mechanism enables the model to automatically prioritize the most discriminative dimensions within a high-dimensional fused feature space, rather than treating all features uniformly. This study is the first to integrate DE-guided feature optimization with attention-enabled BiLSTM networks for ACP prediction, showcasing superior performance and interpretability compared to current leading models. We then carefully selected the optimal hybrid feature and input it into our proposed AttBiLSTM_DE model to accurately predict the ACPs. Finally, we evaluated our model using performance metrics that demonstrate its robustness and efficiency. The diagram in Fig. 1 illustrates the workflow of our complete model architecture.

Fig. 1
figure 1

A schematic diagram that illustrates our proposed framework. Each sub-block signifies the sequential steps undertaken throughout the entire pipeline. The left sub-blocks represent the processes of dataset preparation and feature extraction, while the right sub-blocks correspond to the selection and construction of the best classifier.

Materials and methods

Dataset description

The data set primarily used in this research was initially created and released by the authors of the paper45, which gathered it from 11 recognized anticancer peptide (ACP) prediction methods. The methods are ACP-DL, ACP-MLC57, ACPred, AntiCP 2.0, CpACpP, MLACP, MLACP 2.0, mACPpred, ACPred-Fuse58, ACPred-FL, and AMPfun58. We used this dataset and conducted preprocessing and experiments to ensure data quality and accurate predictions. Here, the ACPs are labeled as “positive” using ‘1’ and the non-ACPs are labeled as “negative” using ‘0’. Initially, sequences containing non-standard or ambiguous amino acids (Z, J, O, U, B, X) were excluded. Redundant sequences and those with extreme lengths (less than 5 or more than 50 amino acids) were also removed. To reduce sequence similarity and avoid bias in performance evaluation, CD-HIT59 clustering was applied with an 85% similarity threshold. This resulted in a refined dataset comprising 1,176 ACPs and 4,001 non-ACPs, from which 1,176 non-ACPs were randomly chosen for the training set to ensure balanced classes. The dataset used for the independent assessment of the model includes 610 ACPs and 2,760 non-ACPs. Further details about the training and independent test datasets are available in the article by Vinoth et al.45. To enhance our preprocessing efforts, we implemented sequence encoding techniques using the primary dataset. This step allowed us to convert raw data into a more structured format, facilitating better analysis and interpretation of the underlying patterns. More detailed information of our preprocessed datasets can be found in the supplementary file Figs. 1 and 2.

Sequence encoding

In this study, we applied four widely used encoding methods60,61—One-Hot Encoding, Word2Vec, FastText, and GloVe—to convert peptide sequences into vector representations. Furthermore, we employed k-mer-based segmentation methodologies in conjunction with these techniques, partitioning sequences into overlapping k-length subsequences (k-mers) to effectively capture local patterns. For instance, a protein sequence such as “ARNDC” can be split into overlapping segments according to the k-mer length. When using 3-mers, it generates: “ARN,” “RND,” and “NDC.” Likewise, with 2-mers, the sequence is transformed into: “AR,” “RN,” “ND,” and “DC.” We generated k-mer embeddings for 1-mer to 6-mer variants using four encoding methods to evaluate their effectiveness. This multi-scale segmentation enables models to capture both short and long contextual patterns in peptide sequences, enhancing the feature space. The input feature vector for our proposed model was created by combining various representations generated from One-Hot, fastText, and GloVe encodings. Specifically, the One-Hot encoding with a k-mer size of 3 resulted in a 15,942-dimensional feature vector for each sequence, considering a maximum sequence length of 1,176 and 20 amino acids. Additionally, both the fastText and GloVe embeddings contributed an extra 512-dimensional representation, which was obtained by averaging over 3-mer windows. Consequently, the initial fused feature vector contained a total of 16,966 dimensions.

One hot encoding

One-hot encoding62 is a crucial technique used to convert categorical variables into numerical vectors that ML algorithms can easily interpret. In the context of protein sequences, each of the 20 standard amino acids is represented by a unique binary vector of fixed length. In this representation, a single position in the vector is set to “1” to indicate the presence of a specific amino acid, while all other positions are filled with “0.” For instance, Alanine (“A”) might be encoded as [1, 0, 0, ..., 0], whereas Arginine (“R”) would be [0, 1, 0, ..., 0]. When applied to an entire protein sequence of length LLL, this process generates a two-dimensional binary matrix The one-hot encoded matrix for a protein sequence of length L can be represented as:

$$\begin{aligned} \textbf{X} \in \{0, 1\}^{L \times 20} \end{aligned}$$
(1)

where each row corresponds to one amino acid and preserves the sequential order of residues.

Mathematically, for each amino acid \(a_i\) at position i in the sequence, the one-hot vector \(\textbf{v}_i \in \{0,1\}^{20}\) is defined as:

$$\begin{aligned} \textbf{v}_i[j] = {\left\{ \begin{array}{ll} 1, & \text {if } j = \text {index}(a_i), \\ 0, & \text {otherwise} \end{array}\right. } \quad \text {for } j = 1, 2, \dots , 20 \end{aligned}$$
(2)

This ordered and sparse method of data representation offers a clear and unbiased way to encode information about amino acids. While one-hot encoding simplifies data by avoiding redundancy, it can create large, sparse matrices with long sequences. To address this, it is often combined with embedding methods, which create dense vectors that capture meaningful relationships.

Word2Vec

Word2Vec63 was developed by a team at Google, led by Tomas Mikolov , for the purpose of training word embeddings. This employs techniques such as skip-grams and continuous bag of words (CBOW) for its implementation. In the field of bioinformatics64,65,66 techniques have been adapted to analyze protein sequences by interpreting them as sentences comprised of “words,” where each word corresponds to a k-mer subsequence of amino acids. To measure the similarity between two words, it uses cosine similarity; the closer the angle, the more alike the meanings of the words. Among the two variants (Continuous Bag of Words -CBOW and Skip-Gram) of Word2Vec63,67, in our study, we employed the CBOW method (Fig. 2) to convert peptide sequences into a machine-readable format.

Fig. 2
figure 2

Architecture of the continuous bag of words -CBOW method of Word2Vec embedding.

The CBOW model predicts a target word, like a k-mer or amino acid token, using a context window of surrounding words, which is useful in bioinformatics due to the contextual relationships found in peptides, similar to natural language. The CBOW model operates by maximizing the probability of accurately predicting a central word \(w_t\) based on the context of surrounding words \(\{ w_{t-c}, \ldots , w_{t-1}, w_{t+1}, \ldots , w_{t+c} \}\). The formal objective function associated with this model is defined as follows:

$$\begin{aligned} CB = \max \prod _{t=1}^{T} P\left( w_t \,\big |\, w_{t-c}, \ldots , w_{t-1}, w_{t+1}, \ldots , w_{t+c} \right) \end{aligned}$$
(3)

Where, T represents the overall count of words (or k-mers) in the sequence collection and c represents the dimension of the context window. To facilitate training, the softmax function is employed to compute the conditional probability:

$$\begin{aligned} P(\omega _t \mid \text {context}) = \frac{\exp \left( \textbf{v}_{\omega _t}^T \cdot \textbf{h} \right) }{\sum _{\omega =1}^{W} \exp \left( \textbf{v}_{\omega }^T \cdot \textbf{h} \right) } \end{aligned}$$
(4)

Where, \(v_{\omega _t}\) represents the output vector corresponding to the target word \(w_t\), encapsulating its semantic meaning within the context, h denotes the average of the embeddings for the surrounding context words, serving as a crucial element in capturing the contextual nuances and W signifies the total number of distinct words present in the vocabulary, providing a framework within which the relationships between these words can be understood.

GloVe

The GloVe68 approach to word embedding in natural language processing was first introduced by Pennington and colleagues at Stanford. It uses a co-occurrence matrix69 to derive the semantic relationships between words. This matrix tracks the co-occurrence frequency of two words (k-mer fragments of amino acid sequences) within a specific context. Let \(x_{ij}\) be the count of how often word j appears near word i. The model aims to find vector representations \(w_i\) and \(w_j\) such that their dot product estimates the logarithm of their co-occurrence frequency:

$$\begin{aligned} J = \sum _{i,j = \perp }^{V} f(X_{ij}) \left( \textbf{w}_i^T \tilde{\textbf{w}}_j + b_i + \tilde{b}_j - \log x_{ij} \right) ^2 \end{aligned}$$
(5)

Where \(w_i\) and \(w_j\) are word and context word vectors, \(b_i\) and \(\tilde{b}_j\) are bias terms, and \(f(X_{ij})\) is a weighting function to down-weight very frequent co-occurrences which is generally defined as equation: 6,

$$\begin{aligned} f(x) = {\left\{ \begin{array}{ll} \left( \dfrac{x}{x_{\text {max}}} \right) ^{\alpha } & \text {if } x < x_{\text {max}} \\ 1 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

For instance, k-mer sequences (like “LKR” and “LKS”) are mapped to vectors that lie close in the multidimensional space. In our study, we applied GloVe on protein sequences to break down into overlapping k-mers (from 1-mer to 6-mer). These k-mers were treated as “words” and trained using a co-occurrence matrix. This matrix is built from the dataset of ACPs and non-ACPs. So, the embeddings effectively captured the relevant patterns and increased the performance of our deep learning classifier.

FastText

Word2Vec model developed by Facebook AI Research70 whose ability to generate embeddings for sub-word units allows for better handling of morphologically rich and rare sequences, thereby providing a more nuanced understanding of peptide characteristics. This approach enhances the understanding of biological sequences and helps identify patterns in rare or new k-mers. To clarify this concept, we utilized the sequence “ATCAG” as an illustrative example. This sequence can be represented through 3-grams as <AT, CAG, TAG, ATC, AT>, alongside the original sequence <ATCAG>. The embedding for the entire word (or k-mer) \(\omega\) is calculated by adding together the embeddings of its individual sub-words:

$$\begin{aligned} \vec {\nu }_{\textbf{w}} = \sum _{g \in G_{\omega }} \textbf{z}_g \end{aligned}$$
(7)

Where \(G\omega\) is the set of character n-grams composing the word \(\omega\) and \(\textbf{z}_g\) is the embedding vector for each n-gram g. This method boosts generalization for altered sequences, essential in anticancer peptide research due to the impact of minor amino acid changes. Hence, we used fastText on k-mer-encoded protein sequences to capture local amino acid patterns and learn effective embeddings from rare fragments.

Feature optimization using differential evolution (DE)

Each sequence’s single-view features capture valuable distinguishing characteristics; however, accurate ML-based ACPs prediction relies on integrating these features into a weighted representation71. Although the conventional practice of concatenating features (“+”) is straightforward, it does not necessarily guarantee strong discriminative performance and may diminish the relative significance of individual base sequences. To address these challenges, we use Feature Weight Optimization along with Differential Evolution (DE)72. This method is great for conducting a global search, achieving fast convergence, and tackling complex, non-linear optimization issues. In this study, we utilize three distinct feature-encoded representations for peptide sequences: GloVe, FastText, and One-Hot Encoded vectors. Each sample is represented by a feature vector from the three encodings. The final dataset combines the three metrics: \(D_{\text {GloVe}} \in \mathbb {R}^{n \times m_1}\), \(D_{\text {FastText}} \in \mathbb {R}^{n \times m_2}\) and \(D_{\text {OneHotEncode}} \in \mathbb {R}^{n \times m_3}\). The combined dataset is:

$$\begin{aligned} X = \left[ D_{\text {GloVe}} \mid D_{\text {FastText}} \mid D_{\text {OneHotEncode}} \right] \in \mathbb {R}^{n \times m} \end{aligned}$$
(8)

Where,

$$\begin{aligned} m = m_1 + m_2 + m_3 \quad \text {and} \quad y \in \{0, 1\}^n \quad \text {(target labels)} \end{aligned}$$
(9)

Then the features were normalized using Min-Max scaling, which ensures equal contribution of all features during optimization:

$$\begin{aligned} \mathbf {x'} = \frac{\textbf{x} - \mathbf {x_{\text {min}}}}{\mathbf {x_{\text {max}}} - \mathbf {x_{\text {min}}}}, \quad \forall \textbf{x} \in X \end{aligned}$$
(10)

To enhance classification accuracy, we applied the DE algorithm, a population-based, derivative-free optimization method suited for continuous feature spaces. The steps are as follows:

  1. 1.

    Initialization In this step, DE initializes feature weights using a population size (NP) of 70. Each individual is a weight vector \(w_i = [\omega _1, \omega _2, \dots , \omega _m]\) with values between 0 and 1, representing the importance of each feature. The algorithm uses a mutation factor (F) of 0.5, a crossover rate (CR) of 0.7, and runs for 100 generations.

  2. 2.

    Fitness Evaluation Each individual vector wi in the DE population contains one weight per feature dimension. When these weights are applied to the feature vectors in dataset \(X\), they create \(X\omega\) through element-wise multiplication, as described in Equation 11. This enables selective amplification or attenuation of individual features based on their relevance to the prediction task.

    $$\begin{aligned} X_w = X \, o \, w_i \end{aligned}$$
    (11)

    Where, o is element-wise multiplication. The weighted dataset is then trained with a Logistic Regression73 model.

    $$\begin{aligned} f(x) = \sigma (w^T x + b) \end{aligned}$$
    (12)
    $$\begin{aligned} \sigma (z) = \frac{1}{1 + e^{-z}} \end{aligned}$$
    (13)
  3. 3.

    DE Operations (i) Mutation: Create a mutant vector using three distinct individuals

    $$\begin{aligned} v_i = w_{i_1} + F \cdot (w_{r_2} - w_{r_3}) \end{aligned}$$
    (14)

    (ii) Crossover: Create a trial vector \(u_i\) from mutant \(v_i\) and target \(w_i\)

    $$\begin{aligned} u_{i,j} = {\left\{ \begin{array}{ll} v_{i,j}, & \text {if } \text {rand}_j \le CR \text { or } j = j_{\text {rand}} \\ w_{i,j}, & \text {otherwise} \end{array}\right. } \end{aligned}$$
    (15)

    (iii) Selection:

    $$\begin{aligned} w_j^{\text {new}} = {\left\{ \begin{array}{ll} u_i, & \text {if } f(u_i) > f(w_i) \\ w_i, & \text {otherwise} \end{array}\right. } \end{aligned}$$
    (16)
  4. 4.

    Termination: After going through 100 generations, the Differential Evolution (DE) algorithm discovers the weight vector \(w^*\) that provides the best classification accuracy when tested against the validation subset. This vector effectively represents how important each feature is in the model’s predictions. Detailed information is in the supplementary file Fig. 10.

Once the optimal vector \(w^*\) is obtained, it’s applied to the fused feature matrix to create the final weighted input \(Xw^*\). This representation is used in training the AttBiLSTM model, enhancing discriminability and reducing redundancy.

After finishing the DE procedure, the final results are produced. Here, we developed a new composite feature, expressed as H7(One-hot encoding+ GloVe + fastText), which is derived by combining the One-hot encoding, GloVe, and fastText features using weighted sequential fusion.

Weights optimization of AttBiLSTM using differenetial evolution

The Differential Evolution (DE) optimization process was modified to directly optimize the weights of the proposed Attention-based Bidirectional LSTM (AttBiLSTM) model. In this approach, DE utilizes the validation accuracy of the AttBiLSTM as its fitness function, ensuring that the optimization aligns with the final prediction architecture. To start, the trained AttBiLSTM model’s weights were flattened and encoded as continuous vectors. A DE population was then initialized with random perturbations around the base model weights. Candidate solutions were evolved over 100 generations using standard DE mutation and crossover strategies. During the fitness evaluation, each candidate solution was projected into the model, and a forward pass was conducted on the validation set to compute its accuracy. The candidate with the highest validation accuracy was selected to update the model parameters. Finally, the optimized weights obtained through DE were further fine-tuned using backpropagation for an additional 100 training epochs.

Proposed model architecture

We present a novel deep learning architecture that integrates a BiLSTM network with an enhanced attention mechanism, aiming to improve the extraction of insights from anticancer peptides. The fundamental goal of this architecture is to generate dense, sequence-aware embeddings that prioritize biologically relevant regions of peptide sequences. The model is composed of an input embedding layer, followed by a BiLSTM layer that captures temporal dependencies. An attention mechanism is then used to focus on relevant parts of the peptide sequences, ending in a classification layer that interprets the learned representations for subsequent tasks. The architecture of the planned AttBiLSTM_DE network is described in detail.

The BiLSTM framework

BiLSTM74 captures important information by considering context from both directions. Many sequence-based problems can be enhanced by recognizing the context surrounding a particular token. BiLSTM networks address this by integrating two LSTM networks: one that processes the sequence in a forward direction, reads the sequence from \(t = \tau _0\) to \(t = \tau _1\) (equ  17) and another that processes it in a backward direction, reads the sequence from \(t = \tau _1\) to \(t = \tau _0\) (equ  18 ). This approach allows for a more comprehensive understanding of the data. At each timestep, the outputs of the forward and backward LSTM units are combined (equ  19) to capture bidirectional dependencies. This way, the model can understand both what happened before (on the left) and what will happen next (on the right) for each part of the sequence75. The mechanism, in a nutshell, is as follows-

$$\begin{aligned} \overrightarrow{h}_t = \textrm{LSTM}_{\text {forward}}(x_t, \overrightarrow{h}_{t-1}) \end{aligned}$$
(17)
$$\begin{aligned} \overleftarrow{h}_t = \textrm{LSTM}_{\text {backward}}(x_t, \overleftarrow{{h}_{t-1}}) \end{aligned}$$
(18)
$$\begin{aligned} \overleftrightarrow {h}_t = \textrm{Combine}(\overrightarrow{h}_t, \overleftarrow{h}_t) \end{aligned}$$
(19)

The BiLSTM network undergoes a forward pass to calculate activations and store states, followed by a backward pass using BPTT for gradient computation. Weights are updated to minimize the loss function, allowing the model to learn more comprehensive features from peptide or protein sequences. The overall workflow of this process is as follows-

A. Forward Pass:

  1. 1.

    Resets all activations to zero prior to inputting data.

  2. 2.

    Iterate from time \(\tau _0\) to \(\tau _1\) : (i) Feed each input \(x_t\) into both the forward and backward LSTM cells. (ii) Update the input gate, forget gate, cell state, and output gate by using the standard LSTM equations:

    $$\begin{aligned} i_t = \sigma (W_i x_t + U_i h_{t-1} + b_i) \end{aligned}$$
    (20)
    $$\begin{aligned} f_t = \sigma (W_f x_t + U_f h_{t-1} + b_f) \end{aligned}$$
    (21)
    $$\begin{aligned} o_t = \sigma (W_o x_t + U_o h_{t-1} + b_o) \end{aligned}$$
    (22)
    $$\begin{aligned} \tilde{c}_t = \tanh (W_c x_t + U_c h_{t-1} + b_c) \end{aligned}$$
    (23)
    $$\begin{aligned} c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \end{aligned}$$
    (24)
    $$\begin{aligned} h_t = o_t \odot \tanh (c_t) \end{aligned}$$
    (25)

    Where, \(i_t\) is the input gate, \(f_t\) is the Forget Gate, \(o_t\) is the Output Gate, \(\tilde{c}_t\) is the Cell Candidate, \(c_t\) is the Cell-State update, and \(h_t\) is the Hidden State.

  3. 3.

    Store all hidden states \(h_t\) and output predictions at each timestep

B. Backward Pass (BPTT - Backpropagation Through Time):

  1. 1.

    Reset all partial derivatives to zero.

  2. 2.

    Starting at time \(\tau _1\), calculate the discrepancy between forecasted and actual labels using cross-entropy loss.

    $$\begin{aligned} L = - \sum _{t} y_t \log (\hat{y}_t) \end{aligned}$$
    (26)

    Where \(y_t\) is the true label, and \(\hat{y}_t\) is the softmax output.

  3. 3.

    Backpropagate the error through the unfolded network using BPTT, calculating gradients for each gate and parameter.

C. Weight Update:

After the gradients for the whole sequence have been gathered using gradient descent or an optimizer to update all weight metrics:

$$\begin{aligned} \theta \leftarrow \theta - \eta \cdot \frac{\partial L}{\partial \theta } \end{aligned}$$
(27)

where \(\theta\) are the model parameters, and \(\eta\) is the learning rate (Fig. 3).

Fig. 3
figure 3

The proposed AttBiLSTM_DE model architecture has three BiLSTM blocks with Self Attention Layers. bN represents here the BatchNormalization layer, and dr represents the Dropout layer.

Attention mechanism

Traditional BiLSTM models can lose important information in longer sequences76,77,78. Attention mechanisms improve BiLSTM by enabling targeted focus, enhancing interpretability, and alleviating memory demands, which aids the model in honing in on pertinent sections while making predictions79. Integrating the attention mechanism with a BiLSTM model improves detection by dynamically weighting the importance of sequence elements. The BiLSTM captures contextual relationships from both preceding and succeeding positions, while the attention mechanism improves comprehension by emphasizing key biological factors, enhancing the model’s accuracy in recognizing anticancer peptides. The Scaled Dot-Product Attention80 employed in our model can be explained as follows: For a given query Q, key K, and value V, the attention mechanism calculates a weighted sum of the values, with the weights being influenced by the similarity between the query and the keys.

$$\begin{aligned} \textrm{Attention}(Q, K, V) = \textrm{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V \end{aligned}$$
(28)

Where, \(Q = K = V = X\). Since we utilize self-attention within our model, all components originate from the identical input \(X \in \mathbb {R}^{T \times d}\), \(d_k\): dimensionality of the key vectors, and T: sequence length. Figure 4 illustrates the attention mechanism concept more clearly.

Fig. 4
figure 4

An Attention Block with Query Q, and Input X, Results in a Weighted Sum. K is the Key which influence The Weights with Query Q’s Similarity.

Architecture of AttBiLSTM_DE

Our architecture integrates BiLSTM networks with self-attention mechanisms. The model includes three stacked BiLSTM layers, where self-attention is applied only after the first two BiLSTM layers to improve feature discriminability and contextual understanding. The architecture begins with an InputLayer of shape (None, 1, 16454), representing the hybrid feature vector extracted from the peptide sequences. After DE-based dimensionality selection, the number of features(16,966) was optimized down to 16,454 then fed into a high-capacity BiLSTM layer containing 512 units, which outputs a tensor of shape (None, 1, 512) and encompasses 34,224,128 trainable parameters. A Dropout layer with a rate of 0.3 mitigates overfitting without changing the output shape, followed by a Batch Normalization layer with 2,048 parameters to stabilize learning and enhance generalization. An Attention mechanism is then applied, producing an output of 512 features at 1 time step with no trainable parameters. The outputs of the batch-normalized BiLSTM and attention mechanism are concatenated, resulting in a tensor with 1024 features at 1 time step. The concatenated tensor is passed through a second BiLSTM layer with 256 units, producing an output of shape (None, 1, 256) and containing 1,180,672 parameters. This layer is followed by a Dropout layer (rate 0.3), Batch Normalization (1,024 parameters), and an Attention mechanism with no trainable parameters. The batch-normalized BiLSTM output and attention output are concatenated to produce a tensor with 512 features. The third BiLSTM layer reduces dimensionality to 128 features (None, 128) with 295,424 parameters, followed by a Dropout layer (0.3) and Batch Normalization (512 parameters). The output is then passed through a series of Dense layers: the first Dense layer reduces it to 64 features with 8,256 parameters and a subsequent Dropout layer (0.3), followed by a Dense layer producing 32 features with 2,080 parameters, and Batch Normalization (128 parameters). Finally, a Dense output layer generates a single prediction with 33 parameters. The architecture contains a total of 36,777,473 parameters, of which 36,775,553 are trainable and 1,920 are non-trainable. The model is compiled using the Adam optimizer81 and employs binary cross-entropy82 as the loss function. Accuracy is used as the main evaluation metric. ELU (Exponential Linear Unit)83 activation is applied in the fully connected layers for smoother and faster convergence, and Dropout is consistently applied after each BiLSTM and Dense layer to prevent overfitting. Training was performed using TensorFlow84 on Google Colab(free version). Despite the high parameter count, the model was trained using a batch size of 16 over 100 epochs, with each epoch completing in approximately 140 seconds. These results demonstrate the practicality and accessibility of our approach for researchers with limited computational resources. To reduce the risk of overfitting caused by high model complexity and a large number of parameters, we implemented several regularization strategies. These include dropout layers with rates of 0.3 after the final BiLSTM module and 0.4 after the first dense layer. Additionally, we applied L2 weight regularization with a coefficient of 0.01 in the first dense layer. The dropout and L2 regularization settings were chosen to ensure convergence without overfitting. The values were confirmed by observing the loss and accuracy curves for the training and validation sets, showcasing consistent convergence and efficient complexity control. Early stopping was not necessary.

Detailed model configuration, training, preprocessing, and architecture tables are provided in Fig. 13, Tables 01 and 02 in the supplementary file.

Rationale for model complexity

The proposed AttBiLSTM_DE model reflects a careful approach to understanding the intricate relationships in peptide sequences. Its architectural complexity arises from the need to effectively capture both the sequential patterns and the contextual dependencies that are key to analyzing these biological structures. The model takes in a high-dimensional input space with 16,454 features, which integrates various representation methods like One-Hot, GloVe, and fastText. Each of these methods highlights different characteristics of peptide structures and their physicochemical properties. To manage this complexity and avoid issues like overfitting, we utilize a feature weighting strategy based on Differential Evolution (DE). This process helps to evaluate and assign importance to each feature, while also downplaying less useful or redundant information. The AttBiLSTM framework itself combines the strengths of bidirectional modeling with a detailed attention mechanism, allowing it to effectively learn the nuanced relationships between sequence patterns and their potential anticancer activities. While the model is ambitious, with over 36 million trainable parameters, we’ve implemented strict measures like dropout, batch normalization, and L2 regularization to ensure stable learning, even when working with a smaller dataset. This careful balance of complexity and power enables the model to perform effectively without falling into the trap of overfitting, a fact confirmed by thorough cross-validation and independent testing.

Results and discussion

Evaluation metrics

Evaluation metrics serve as essential parameters for assessing a model’s performance. Relying solely on accuracy or just a few metrics can lead to an incomplete understanding of a model’s effectiveness. To provide a thorough evaluation of our proposed classification model, we utilized a diverse set of metrics: Accuracy, Precision, Recall (Sensitivity), Specificity, F1-score, AUC, and MCC. By incorporating these various metrics, we can achieve a more balanced and comprehensive view of the model’s performance. The detailed description of the metrics is given as follows:

$$\begin{aligned} \text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(29)
$$\begin{aligned} \text {Precision} = \frac{TP}{TP + FP} \end{aligned}$$
(30)
$$\begin{aligned} \text {Recall} = \frac{TP}{TP + FN} \end{aligned}$$
(31)
$$\begin{aligned} \text {Specificity} = \frac{TN}{TN + FP} \end{aligned}$$
(32)
$$\begin{aligned} \text {F1} = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(33)
$$\begin{aligned} \text {MCC} = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{aligned}$$
(34)

AUC measures a model’s ability to distinguish between classes. It indicates how effectively it ranks positive instances higher than negative ones for accurate classifications. True Positives (TP) are instances where the model correctly identifies the positive class, while True Negatives (TN) are instances where the model accurately predicts the negative class. On the other hand, False Positives (FP) occur when the model mistakenly predicts a positive outcome despite the actual result being negative, and False Negatives (FN) arise when the model incorrectly identifies a negative outcome when the true result is positive.

k-mer Embedding encodes more accurate motifs information

In order to identify the optimal k-mer length for our AttBiLSTM_DE model, we conducted a series of experiments examining k values from 1 to 6. This exploration involved four widely recognized feature encoding techniques: One-Hot Encoding, Word2Vec, GloVe, and fastText. For each unique combination of encoding method and k-mer length, we thoroughly assessed the model’s performance on the original dataset. We utilized a range of performance metrics to ensure a comprehensive evaluation. This method helped us grasp how various configurations impacted the model’s effectiveness.

Table 1 k-mer embedding encodes more accurate motif information.

The findings presented in Table 1 demonstrate that the One-Hot Encoding approach with a k-mer length of 3 yields the best performance metrics compared to all other encoding methods assessed. Specifically, it achieves an AUC score of 87.66%, indicating a strong ability to distinguish between different classes. Additionally, this method reports an accuracy rate of 83.29%, a MCC of 66.66%, and an F1-score of 81.85%. These metrics collectively underscore the robustness of One-Hot Encoding with 3-mers in effectively capturing the discriminative patterns inherent in the motif data.

On the other hand, when the length of the k-mers exceeds 3, there is a significant reduction in performance for all encoding strategies considered in this research. The decline indicates that longer k-mers, though capable of representing complex patterns, may add excessive noise and lead to data sparsity. This can hinder the learning process and affect the model’s predictive accuracy. It emphasizes the need to optimize k-mer lengths to balance complexity and clarity in motif detection.

Among the different techniques for creating word embeddings, GloVe and Word2Vec using 3-mer (three-letter groups) perform reasonably well. For example, GloVe with 3-mer has an AUC score of 84.47% and an F1-score of 77.56%. Word2Vec with 3-mer does even better, with an AUC of 87.17% and an F1-score of 79.17%. However, they are still not as good as the One-Hot technique in important measurements like MCC and accuracy.

On the other hand, the 4-mer to 6-mer embeddings generally do not perform as well, especially with fastText and GloVe. FastText does well with the 3-mer, achieving an F1-score of 78.28%. However, when using longer groups of letters (like 6-mers), its performance drops sharply, with the F1-score falling to 44.68%. This suggests that fastText’s method of using smaller letter groups does not work as effectively for longer patterns in biological sequences. The table 04 in the supplementary file shows the best performance kmers.

Fig. 5
figure 5

ROC curves for the AttBiLSTM_DE Model across 1 to 6 kmers of the Four Feature Extractors Ploting the trade-off between the True Positive Rate and the False Positive Rate. Here, (A) One Hot Encoding, (B) Word2Vec, (C) GloVe, and (D) FastText.

Fig. 6
figure 6

Bar Plots of the AttBiLSTM_DE Model on 1 to 6 kmers Illustrating the Performance metrics- AUC, Accuracy, MCC, Sensitivity, Specificity, Precision and F1 Score. Here, (E) One Hot Encoding, (F) Word2Vec, (G) GloVe, and (H) FastText.

The ROC curves for our AttBiLSTM_DE models, which use different k-mer lengths ranging from 1-mer to 6-mer, are illustrated in Fig. 5. The models utilize four distinct feature extraction methods: One-Hot encoding, Word2Vec, GloVe, and fastText. In each subfigure (Fig. 5A–D), we present the comparative ROC curves for these different embedding techniques, allowing for a clearer evaluation of their performance.

In a striking display of performance, the Word2Vec-based models (Fig. 5B) reached their peak at the 3-mer configuration, boasting an impressive AUC of 87.17%. However, as we ventured into 4-mers and longer sequences, there was a significant decline in effectiveness. This clearly underscores just how crucial the choice of k-mer length is for the success of this method. Likewise, the GloVe and fastText models (Fig. 5C,D) displayed moderate AUC values with shorter k-mers (1–3), whereas the use of longer k-mers (4–6) led to a decline in classification performance, indicating that excessive fragmentation of sequences obscures significant patterns.

Fig.  6E–H present bar plots that showcase the distribution of evaluation metrics for each feature extractor across six different k-mer configurations. The One-Hot encoding method (Figs. 5A,  6E) consistently exceeded the performance of others, reaching a peak at 3-mer with an AUC of 87.66%, an F1-score of 81.85%, and an MCC of 66.66%. The Word2Vec method (Figs. 5B,  6F) reached its highest F1-score at 3-mer (79.17%) and displayed the greatest sensitivity at 1-mer (91.76%), although it showed lower specificity. Models based on GloVe (Figs. 5C,  6G) provided similar performance at lower k-mer values but shows performance drop sharply beyond 3-mer, with an AUC of only 59.21% at 4-mer. In contrast, fastText (Figs. 5D,  6H) produced reasonable outcomes at 3-mer (F1-score = 78.28%), but its performance significantly declined at larger k values, with both MCC and F1-score falling below 20% at 6-mer. These bar plots clearly illustrate the dominance of the 3-mer configuration across all embedding techniques, with One-Hot encoding emerging as the most reliable and consistently high-performing method for ACP classification using AttBiLSTM_DE. The visual results in Figs. 6 and 5 show that using 3-mer encoding leads to the best performance across all the different embedding strategies. Additionally, it appears that One-Hot encoding outperforms as the most reliable and effective way to represent data for ACP classification when using AttBiLSTM_DE.

Effects of different feature combinations

In this section of our study, we methodically evaluated the influence of various combinations of feature encoding techniques on the performance of the proposed AttBiLSTM_DE model for ACP identification. Our findings demonstrate a clear relationship between these encoding methods and the model’s effectiveness. We assessed various single-feature encoding methods, such as One-Hot, Word2Vec, GloVe, and fastText, as well as hybrid feature encodings that combine these techniques (for instance, One-Hot + Word2Vec, One-Hot + GloVe, etc.) to enhance the representation of peptide sequences.

Table 2 Performance comparison of single-view and hybrid feature extractors across various evaluation metrics.
Fig. 7
figure 7

Bar Plot and Heatmap of Hybrid Feature Extractors With the Performance Metrices ( AUC, ACC, MCC, Sensitivity, Specificity, Precision and F1 Score). Here, h1: One Hot Encoding, h2: Word2Vec, h3: GloVe, and h4: FastText.

Table 2 (and table 03 in supplementary file) compares the performance of various encoding methods. One-Hot Encoding achieves the best AUC (87.66%) and F1-score (81.85%). Word2Vec provides a strong F1-score of 79.17%. GloVe offers good precision at 85.21% and specificity at 88.52%, but it has lower sensitivity at 71.18%. On the other hand, fastText demonstrates high sensitivity at 85.88%, albeit with lower specificity at 68.85% and an AUC of 85.36%. We also explored two state-of-the-art pre-trained protein language models, ESM and ProtT5, along with the traditional feature extractors. These models rely on transformer-based architectures and have been trained on extensive datasets of protein sequences. This makes them particularly effective for sequences-based task, like anticancer peptide prediction. Both ESM and ProtT5 showed moderate performance compared to classic feature extractors.While ProtT5 slightly outperformed ESM in terms of MCC and F1-score. However, neither of them outperformed GloVe or One-Hot encoding. To boost the performance of our model, we decided to experiment with hybrid feature sets by mixing various encoding methods. This approach led to significant improvements. To identify the most informative and complementary combinations, we employed DE as an optimization strategy to create weighted feature sets. This method enabled us to assign optimal weights to each feature representation in the hybrid set, effectively balancing their contributions and minimizing redundancy. As a result, the DE-guided combinations significantly enhanced the model’s discriminative power, leading to improved performance in ACP prediction. This optimization process not only helped identify the most informative feature combinations but also enhanced the model’s discriminative ability by reducing feature redundancy and noise, ultimately resulting in better ACP prediction performance.

The combination we identified as H7, which includes One-Hot encoding, GloVe, and fastText, turned out to be the standout performer. It achieved impressive results across nearly all metrics, with an AUC of 94.74%, an accuracy of 86.86%, an MCC of 74.25%, and an F1-score of 86.58%. H6, which combined One-Hot encoding with GloVe and Word2Vec, scored 84.79%, while H5, using One-Hot with fastText, scored 83.33%. These results underscore the benefits of integrating multiple semantic features.

The findings are visually depicted in Fig. 7, which features a bar chart (Fig. 7A) for metric-based comparisons and a heatmap (Fig. 7B) illustrating the performance distribution across various feature encoders. The bar chart emphasizes the significant difference in performance between single encoders and hybrid encoders, while the heatmap facilitates a quick evaluation of which combinations excelled across particular metrics. Therefore, after carefully analyzing the performance of all feature extraction strategies, we found that while some excelled in specific areas, they lacked overall balance. In contrast, H7 (One-Hot + GloVe + fastText) consistently produced reliable results across all metrics. Hence, we selected H7 as the final approach for our AttBiLSTM_DE model to ensure robust performance. More visualization results can be found at Figs. 03-09 in the supplementary file.

Effects of selected feature on independent dataset

After successfully assessing the training dataset of the four feature extractors, pre-trained protein language models(ESM and ProtT5), and the combinations, we finally select the hybrid feature extractor (One-Hot + GloVe + fastText) as our final dataset. Therefore, we applied our AttBiLSTM_DE model to an independent dataset using only the single 4-feature extractor, 3-mer, since it consistently yielded better results compared to most other extractors as well as the hybrid dataset we considered.

Table 3 Performance of The Independent Dataset for Single and the Choosen Hybrid feature extractors across various Evaluation Metrics.

The following table 3 illustrates the results for the independent set. From the table, we see the effectiveness of various feature encoding methods—specifically One-Hot Encoding, Word2Vec, GloVe, fastText,

Fig. 8
figure 8

Roc Curve Plotting the trade-off between the True Positive Rate and the False Positive Rate for An Independent Dataset on Single and the Choosen Hybrid Feature Extractors. H7 Represents here, (One hot + GloVe + fastText).

and a combined approach (One-Hot + GloVe + fastText)—utilizing our proposed AttBiLSTM_DE model. Notably, GloVe emerged as the standout method among the individual encodings, achieving the highest results across all metrics, particularly excelling in MCC (77.85%), Sensitivity (85.71%), and F1-score (83.02%). Furthermore, One-Hot Encoding demonstrated strong performance, particularly in Specificity (96.54%) and Accuracy (91.69%). Word2Vec and fastText demonstrated commendable Precision scores, achieving 100.00% and 95.45%, respectively. However, both the extractor exhibited lower Sensitivity levels, at 54.55%, which signifies a higher efficacy in identifying true negatives rather than true positives. Besides, both ESM and ProtT5 showed good performance, achieving AUC values over 93%. However, they still lagged behind traditional methods, such as GloVe and the hybrid approach. More importantly, both models had lower Matthews Correlation Coefficient (MCC) and F1-scores.

Fig. 9
figure 9

Heat Maps For the Comparison of the Existing Attention-Based Deep Learning Architectures. AttBiLSTM_DE(Attention-based BiLSTM with Optimized Weighted Feature) is the Proposed Model. (A) HeatMap For Training Dataset, and (B) HeatMap For Independent Dataset.

The combined encoding method (One-Hot + GloVe + fastText) proved to be the most effective overall, surpassing all individual techniques with exceptional metrics—AUC of 98.48%(see in the Fig. 8 the ROC curve), Accuracy of 95.85%, MCC of 88.00%, and F1-score of 90.54%. This finding highlights the significant performance enhancements achievable through the integration of complementary features, suggesting that collaborative approaches can lead to improved outcomes in model performance. H7 achieves a high specificity of 98.46% on the independent test set. High specificity indicates that the model effectively avoids false positives, which is biologically significant in ACP prediction: non-anticancer peptides are accurately identified, minimizing the risk of incorrectly classifying a non-ACP as an ACP. This level of precision is particularly important for downstream experimental validation, where false positives could lead to unnecessary laboratory testing.

Fig. 10
figure 10

Line plots and bar plots for the comparison of the existing attention-based deep learning architectures. AttBiLSTM_DE(Attention-based BiLSTM with Optimized Weighted Feature) is the Proposed Model. (A) Line Plot For Training Dataset, (B) Line Plot For Independent Dataset, (C) Bar Plot For Training Dataset, and (D) Bar Plot For Independent Dataset.

Reproducibility and robustness analysis using K-fold cross-validation

We conducted a reproducibility experiment using three different random seeds and five-fold cross-validation to test our ACP prediction model. The results were encouraging, showing that the model consistently performs well, no matter how we partition the data or initialize the conditions. In fact, as highlighted in Table 4, we found that the model achieved an impressive average accuracy of 94.07% with a small variation of just ± 0.40. Additionally, it recorded an AUC of 96.69% and an F1-score of 82.68%, both with low standard deviations across different runs. These results suggest that the model is quite robust—it doesn’t seem to be affected by random changes or variations in the folds, which adds to its reliability. In contrast, When the model was trained and evaluated using just one random split instead of cross-validation, it really showed its potential. It achieved an impressive accuracy of 98.48%, an AUC of 95.85%, precision of 88.00%, recall of 87.01%, specificity of 98.46%, an F1-score of 94.37%, and an MCC of 90.54%. These results highlight how well the model can discriminate between different classes in this single evaluation. However, while these impressive numbers reflect its strong performance, the cross-validation results are also important as they demonstrate the model’s reliability and ability to generalize across various random setups. Therefore, the consistent performance we observed across different test setups shows that our model is both reliable and strong. It effectively tells apart anticancer peptides from non-anticancer sequences, even when faced with a variety of experimental conditions.

Table 4 K-Fold Cross-Validation Results across 3 Random Seed Runs (5 Folds Each).

Comparison of existing attention-based models

In this section, we compare our proposed architecture with several existing attention-based models, such as AttXGBoost, AttDenseFusion, AttGRU, and AttMLP. Table 5 shows the performance of six different models—AttXGBoost, AttDenseFusion, AttGRU, AttMLP, AttBiLSTM, and AttBiLSTM_DE—on both the training dataset and an independent test dataset. Here, AttBiLSTM_DE refers to the attention-based model that uses Differential Evolution for concatenating the hybrid feature extractor.

Table 5 Performance comparison of different attention-based models on Train and Independent datasets across various evaluation metrics.

In our evaluation, the AttBiLSTM_DE model demonstrated exceptional performance, attaining the highest scores across nearly all categories. It reached impressive numbers with an AUC of 94.74%, Accuracy of 86.86%, MCC of 74.25%, and an F1-score of 86.58%. The regular AttBiLSTM also performed admirably and closely followed behind. In contrast, models like AttXGBoost and AttMLP displayed lower sensitivity and MCC, indicating they struggled more to accurately identify positive cases. However, the overall results were very promising. In the evaluation using an independent dataset, which assesses the model’s effectiveness on novel data, AttBiLSTM_DE demonstrated the most favourable performance. The model achieved an AUC of 98.48%, an accuracy of 95.85%, and an F1-score of 90.54%, demonstrating its reliability and strong generalization capabilities. Other models, notably AttMLP and AttDenseFusion, also exhibited commendable performance; however, they did not reach the performance levels established by AttBiLSTM_DE. We visualized the models to better compare their behaviour on both training and independent datasets. The confusion matrix and the per class evolution metrics of our proposed AttBiLSTM_DE model for both train and independent dataset are visualized at Fig. 11 in the supplementary file.

Fig. 11
figure 11

ROC Curves Ploting the True Positive Rate (TPR) against the False Positive Rate (FPR) for the Comparison of the Existing Attention-Based Deep Learning Architectures. AttBiLSTM_DE(Attention-based BiLSTM with Optimized Weighted Feature) is the Proposed Model. (A) ROC Curve For Training Dataset, (B) ROC Curve For Independent Dataset.

The ROC curve in Fig. 11 shows how well each model differentiates between positive and negative classes. The AttBiLSTM_DE curve is closest to the top-left corner, indicating it has the highest true positive rate and lowest false positive rate, confirming its superior performance. Figure 10C,D displays a bar plot of key performance metrics, such as accuracy and F1-score, for all models. This clearly shows that AttBiLSTM_DE consistently outperforms the others across most metrics on both datasets. The line plot (Fig. 10A,B) shows that the AttBiLSTM_DE model outperforms the others in terms of metrics like the MCC and F1-score. This suggests it’s well-balanced in terms of both precision and recall. When looking at the ROC curve for the training dataset, it’s clear that AttBiLSTM_DE effectively learns data patterns, achieving a high AUC score. The ROC curve for the independent dataset further demonstrates that the model can generalize effectively to new data. The heatmap (in Fig. 9A) for the training dataset visualizes strong classification metrics for AttBiLSTM_DE, while the heatmap (Fig. 9B) for the independent dataset reaffirms its impressive performance on unseen data, particularly in areas like F1, Precision, and MCC. Overall, our summary visualization highlights AttBiLSTM_DE as the standout model among its peers.

Fig. 12
figure 12

Model average ranking indicating the performance statistics of the existing models with our proposed model. Lower the value better the performance rank.

Fig. 13
figure 13

HeatMap for the ablation study on the proposed model architecture. A0 to A8 are the changed architectures with various layers and parameters. (A) Train Dataset and (B) Independent Dataset.

To enhance the visualization of how different models perform, we employed an average ranking method (Fig. 12) across various evaluation metrics, such as AUC, Accuracy, MCC, Sensitivity, Specificity, Precision, and F1-Score. Each model received an individual rank for every metric, and our model ranked the highest for its best performance. We then computed the average of these ranks for each model across all metrics. A lower average rank indicates superior overall performance. This approach offers a well-rounded comparison by taking into account all aspects of performance and is widely utilized in machine learning literature85. We also evaluate the performance of our proposed approach by making predictions on different sequences, those results are visualized in the Fig. 12 and Tables 5, 6 in the supplementary file.

Using attention in a BiLSTM architecture offers significant advantages for predicting amino acid properties compared to other models like GRU, AttDenseFusion, XGBoost, or MLP86. The BiLSTM can capture relationships in the data from both directions, which lets the model understand how amino acids upstream and downstream affect the function of peptides—something traditional models like GRU or MLP can’t really do87. What sets the attention mechanism in BiLSTM apart is its ability to focus on important positions in the sequence while learning from the contextual information dynamically. This means the model not only becomes more accurate but also more interpretable, which is essential in biological contexts. Moreover, BiLSTM combined with attention is great at handling different types of features, whether they’re encoded in One-Hot, GloVe, or fastText formats87. This versatility allows it to capture complex relationships within sequences effectively, making it a stronger choice for peptide classification tasks compared to other attention-based models.

Comparison of our proposed model with existing works

We conducted a thorough evaluation of AttBiLSTM_DE along with other publicly available predictors for ACP using a separate dataset. The predictors we examined include mACPpred2.045, mACPpred44, MLACP2.043, ACPred-BMF(Main), ACPred-BMF(Alternate)88, ACPred31, AMPfun89, AntiCP2.0(Main)41, AntiCP2.0 (Alternate), CancerGram90, and iDACP91 shows that, our proposed model, AttBiLSTM_DE, demonstrably surpasses all existing methods across a comprehensive range of metrics, achieving the highest performance figures in global metrics, including an accuracy (ACC) of 95.85%, a Matthews correlation coefficient (MCC) of 88.00%, an area under the curve (AUC) of 98.48%, and an F1-score of 90.54%. These results indicate a superior overall predictiv capability when compared to alternative models for predicting ACP. The nearest competitors, mACPpred2.0 and MLACP2.0, report ACCs of 85.70% and 83.80%, MCCs of 59.70% and 55.70%, and AUCs of 88.70% and 89.30%, respectively. However, they remain significantly inferior in terms of F1-score and overall accuracy. This substantial performance enhancement underscores the robustness and reliability of our proposed approach (Table 6).

Table 6 Performance comparison of our model with other publicly available predictors on an independent dataset.

Ablation study on training set and independent set

Ablation study is a technique used to determine how different components of a model affect its performance by systematically removing or altering parts and observing the changes. Ablation studies are essential for assessing the effectiveness of different model components. They identify impactful elements, reduce unnecessary complexity, and provide a scientific basis for model design by analyzing each part’s contribution.

Fig. 14
figure 14

Box plot of ablation study on the proposed model architecture with different performance matrices. (A) Train Dataset and (B) Independent Dataset.

In our research, we performed an ablation study on the training dataset to assess the impact of various elements of the AttBiLSTM_DE model. In particular, we evaluated the model by omitting specific combinations of features or architectural components to see how the performance varied. Table 7 demonstrates how changes to components like feature fusion layers or attention modules influence the model’s performance. The baseline model (A0) really outshines the others when we look at various metrics such as AUC, accuracy, sensitivity, and F1-score.It is noteworthy that when we started removing key components like attention mechanisms, LSTM layers, or bidirectionality in models A1 through A8, the performance declined significantly. This observation clearly shows the critical role of each component of the model for achieving high accuracy and overall effectiveness. It’s a great reminder of how every element in our approach contributes to success.

Table 7 Ablation study results on Train and Independent datasets.

To enhance our understanding of the ablation results, we incorporated a Heatmap (Fig. 13). This visualization uses colour coding to depict performance across various ablation settings, with darker shades representing better performance. This allows for a straightforward comparison of each configuration. Additionally, we included a box plot (Fig. 14) to illustrate the distribution of performance metrics such as Accuracy and F1 across different ablation setups. This plot further reinforces the consistent stability and superior performance of the complete AttBiLSTM_DE model compared to its ablated versions. We also conducted an ablation study on the independent dataset to assess how model changes impact generalizability. The baseline model (A0) consistently achieved the highest performance, confirming its robust design. Removing key components reduces the model’s performance, indicating their importance for accurate classification. So, overall, our model gives very efficient and robust performance for classifying ACPs and non-ACPs.

Case study

Anticancer peptides (ACPs) are effective therapeutic agents with high selectivity and low toxicity towards cancer cells. Identifying ACPs from protein sequences is crucial for drug discovery but is often costly and time-consuming. This study aims to predict ACPs using integrated deep learning and feature embedding methods.

Fig. 15
figure 15

(A) Precision–Recall curve showing a PR-AUC of 0.9619, (B) Sensitivity–Specificity trade-offs, and (C) Decision Curve Analysis (DCA).

Our approach to predicting ACPs has some notable improvements compared to previous leading models. For instance, the recent mACPpred 2.044 made strides using a stacked deep learning framework and various descriptors, including NLP-based embeddings to enhance its predictive capabilities. Meanwhile, ACPred-BMF88 utilized Bi-LSTM architectures combined with interpretable features like BPF, Quanc, and Qualc, along with SHAP for analyzing feature importance. Our model apart in the sence of how we blend modern embedding techniques, like fastText and GloVe, with one-hot encoding to create a hybrid representation. The study explored various sequence-based embeddings such as fastText, GloVe, Word2Vec, One-Hot, Prot5, and ESM, ultimately identifying fastText, GloVe, and One-Hot as the best hybrid features. These features were used to train an Attention-BiLSTM (AttBiLSTM) model, optimized with Differential Evolution for hyperparameters. The model’s robustness was evaluated through 5-fold cross-validation across three random seeds, with performance metrics reported as mean ± standard deviation. On an independent test set, the AttBiLSTM_DE demonstrated high accuracy, AUC, and MCC, effectively discriminating between ACPs and non-ACPs.

To validate the effectiveness of the proposed Logistic Regression guided DE strategy, we conducted a comparative analysis against an alternative DE approach guided directly by the AttBiLSTM model. As shown in Table 5, both methods achieved comparable and high AUC values, 98.48% for LR-guided and 98.39% for AttBiLSTM guided, demonstrating strong classification performance. While the AttBiLSTM guided version achieved a slightly higher accuracy, 96.44%, the LR-guided DE strategy showed higher precision 94.37% and F1-score 90.54%, suggesting better reliability in correctly identifying positive samples which is an important aspect in biological sequence prediction where false positives are particularly costly. This advantage is attributed to Logistic Regression’s sensitivity to linear feature boundaries, enabling it to act as a robust surrogate in guiding feature weight optimization during DE. Overall, the comparison confirms that while both strategies are effective, the LR-guided DE retains a practical edge in key performance metrics, supporting its use as the preferred optimization approach in our framework.

In the Fig. 15, the PR-AUC (0.9619)  15A and ROC-AUC (0.9804) 15B scores on the independent test set indicate that the model exhibits strong discriminatory power and a low false-positive bias despite significant class imbalance. Additionally, the decision curve analysis 15C reveals that the model consistently provides a greater net benefit compared to the “treat-all” or “treat-none” approaches across clinically important probability thresholds, highlighting not just statistical reliability but also practical decision-making value in identifying ACP.

For biological interpretation, we analyzed several top sequences predicted by our model from Supplementary File Table 6. These sequences were selected based on their highest predicted probability scores for ACP activity. For instance, the peptide DSHAKRHHGYKRKFHEKHHSHRGYRSNYLYDN has a prediction score of 0.9986 for ACP activity, indicating a high-confidence classification. The attention layers identify the most informative features from One-hot, GloVe, and fastText embeddings, demonstrating that the model focuses on biologically relevant sequence properties, such as k-mer patterns and amino acid composition. In contrast, the peptide RGIRGSSAARPSGRRRDPAGRTTETGFNIFTQHD has a prediction score of 0.0072, indicating that it is likely not an ACP, with only a 0.72% chance of being one. To further evaluate the real-world applicability and biological relevance of the proposed model, we conducted a case study using experimentally validated ACPs and ACP small molecule conjugates. Specifically, we assessed the model’s predictive performance on LTX-315, a synthetic cationic peptide with well documented anticancer and immunomodulatory activity in preclinical studies and ongoing clinical trials. The peptide sequence KKWWKWWKW-FKRKR was encoded using the selected hybrid feature representation (FastText + GloVe + One-hot) and evaluated using the proposed AttBiLSTM_DE model. The model successfully assigned a high prediction score (97.98%)to this peptide, in alignment with its known anticancer potential. Furthermore, although NTP-38592 and FXY-393 are peptide–small molecule conjugates rather than standalone peptide sequences, they are based on LTX-315 and share similar biological mechanisms. Their inclusion underscores the model’s potential utility in predicting anticancer activity not only in pure peptides but also in hybrid peptide-based therapeutics.