Introduction

Post Translational Modification (PTM) is the enzymic or chemical modification of a protein after it is translated or synthesized in the ribosome. The PTMs are occurred via removal of parts of a translated protein, covalent modifications, or degradation of modified proteins1,2. These modifications provide important insight into various cellular functions and biological processes of proteins such as cellular dynamics and elasticity.

PTMs are important mechanisms to increase proteomic diversity, and play a vital role in functional proteomic because they regulate activity, localization, and interaction with other cellular molecules such as proteins, nucleic acids, lipids, and cofactors3. They can impact the structure, electrophilicity, and interactions of proteins. PTMs also regulate protein folding via targeting specific subcellular compartments, interacting with ligands or other proteins, or by initiating a change in their functional state including signaling or catalytic activity4. A wide range of PTMs have been identified so far. The common PTMs include phosphorylation, glycosylation, ubiquitination, nitrosylation, methylation, acetylation, lipidation, and proteolysis which influence almost all aspects of normal cell biology and pathogenesis5.

AMPylation is an emerging Post Translational Modification mediated by a bacterial virulence factor that transfers Adenosine Monophosphate (AMP) from Adenosine Triphosphate (ATP) to a threonine residue of eukaryotic substrates6,7. AMPylation is the covalent attachment of AMP to a protein or peptide8. It has been studied exclusively with the Fic domain proteins, which are preserved and found in proteins stretching from bacteria to humans. By adding AMP to Rho-family GTPases, these enzymes can thereby mediate both bacterial pathogenesis and eukaryotic signaling9,10. The most common and stable form of AMPylation occurs on the hydroxyl group of threonine, serine, or tyrosine via a phosphodiester bond. In the AMPylation process, Adenosine Monophosphate (AMP) gets covalently attached to the amino acid side chain of a protein molecule. AMPylation involves a phosphodiester bond between a hydroxyl group of the molecule undergoing AMPylation and the phosphate group of the adenosine monophosphate nucleotide (i.e. adenylic acid)14. The enzymes that are capable of catalyzing this process are called AMPylators. Threonine (T) and Tyrosine (Y) amino acids are usual targets of AMPylation while this PTM can sometimes be observed in Serine (S) as well.

Recent proteomics studies demonstrated that this PTM is more common than generally acknowledged and it is emerging as a significant regulatory mechanism for both eukaryotic and prokaryotic cells. It is impelled in a vast area of biological processes stretching from regulation of nitrogen metabolism in bacteria and regulation of signaling pathways to pathogenesis in several animal species11,12,13,14. AMPylation has also found to play a significant role in the regulation of neurodevelopment and neurodegeneration15.

Experimental approaches used to determine PTM sites are expensive, laborious, and time taking. Hence, many studies have been proposed to predict PTM sites using fast and cost effective computational approaches16,17,18,19,20,21,22,23,24,25,26,27,28. In 2015, Khater and Mohanty, used SVM and HMM to develop a computational protocol for identification of AMPylation domains and their classification into various functional subfamilies such as, catalyzing AMPylation, deAMPylation, phosphorylation, and phosphocholine transfer29. However, they did not directly predict if a given peptide is AMPylation or non-AMPylation.

To the best of our knowledge, so far no computational approach has been proposed for predicting AMPylation sites of Fic domain protein. One of the main reasons is that there is no AMPylation dataset available to be used for this task. In this study, we are presenting a new dataset of protein AMPylation sites. Furthermore, we also propose a new deep convolutional neural network (CNN) model called DeepAmp for predicting protein AMPylation sites on the newly found dataset of AMP modified proteins. DeepAmp achieves 77.7%, 79.1%, 76.8%, 0.55, and 0.85 in terms of Accuracy, Sensitivity, Specificity, Matthews Correlation Coefficient (MCC), and Area Under Curve (AUC) for AMPylation site prediction task, respectively. As the first machine learning model, DeepAmp demonstrate promising results which highlight its potential to solve this problem. We believe this study will help researchers immensely in terms of mitigating the current research gap in this subject. Our presented dataset and DeepAmp as an standalone predictor are publicly available at https://github.com/MehediAzim/DeepAmp.

Results and discussion

Evaluation metrics

In order to ensure standardized evaluation of our model and to provide more insights into our results, we calculate the Accuracy, Sensitivity, Specificity, and Mathews correlation coefficient (MCC) as the evaluation metrics. These metrics are characterized by the following equations:

$$\begin{aligned} Accuracy= & {} \frac{tp+tn}{tp+tn+fp+fn} \times 100, \end{aligned}$$
(1)
$$\begin{aligned} Sensitivity= & {} \frac{tp}{tp + fn} \times 100, \end{aligned}$$
(2)
$$\begin{aligned} Specificity= & {} \frac{tn}{tn + fp} \times 100, \end{aligned}$$
(3)
$$\begin{aligned} MCC= & {} \frac{(tp \times tn) - (fp \times fn)}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}, \end{aligned}$$
(4)

where tp denotes true positive, and tn, fp, fn denote true negative, false positive, and false negative, respectively.

Additionally, to show the model’s distinguishing capability between AMPylated and non-AMPylated sites, we calculated the Area Under the Curve (AUC). AUC measures the ability of a classifier to distinguish between classes. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative instances. AUC value of 1 indicates that the classifier can differentiate between all the Positive and the Negative class points, correctly. While, an AUC value of 0 indicates poor performance30.

Comparison with different machine learning techniques

Since DeepAmp is the first computational model proposed to predict AMPylation PTM, it is not possible to compare model performance with any other studies. However, to investigate the effectiveness of CNN to build DeepAmp, we compare it with other ML models to solve this problem. Results achieved using DeepAmp compared to other ML models including Support Vector Machine (SVM), Random Forest (RF), Linear Regression (LR), Decision Tree (DT), and K-Nearest Neighbor (KNN) using same set of features are presented in Tables 1 and 2 for fivefold and tenfold cross-validations, respectively. We present the average of 10 runs of fivefold and tenfold cross-validations model for all the metrics in Tables 1 and 2. As shown in these tables, DeepAmp achieves significantly better results in terms of all four metrics than other machine learning methods which are investigated in this study.

As shown in Table 1, DeepAmp achieves 75.9%, 77.2%, 75.2%, 0.52, and 0.84 in terms of Accuracy, Sensitivity, Specificity, MCC, and AUC for AMPylation site prediction task using fivefold cross validation, respectively. Also, according to Table 2, DeepAmp achieves 77.7%, 79.1%, 76.8%, 0.55, and 0.85 in terms of Accuracy, Sensitivity, Specificity, MCC, and AUC for AMPylation site prediction task using tenfold cross validation, respectively. As shown in these Tables, the prediction accuracy and MCC of other classifiers explored in this study are all below 70.0% and 0.30, respectively which demonstrate the effectiveness of using CNN to build DeepAmp.

Table 1 Different models result on AMPylation (fivefold CV).

As shown in Tables 1 and 2, the results using tenfold cross-validation are slightly better than those reported using fivefold cross-validation. This can be associated with larger number of samples used to train our model in tenfold cross-validation. In k-fold we evaluate the model on 1/k part of the data and train on the rest. Therefore, for our dataset, in tenfold we are using 362 samples for training while in fivefold we are using 320 samples for training in each iteration. As a result, there are more samples available to train the model using tenfold rather than 5-fold. This also suggest that by having a larger dataset, DeepAmp is able to achieve even better results.

Table 2 Different models result on AMPylation dataset (10 fold CV).

In Fig. 1, the receiver operating characteristic curves (ROC curves) clearly illustrate the capability of distinguishing the AMPylation and non-AMPylation sites of the DeepAmp model. To provide further information for the readers, the ROC curve for fivefold and tenfold cross validation (ROC curve for each fold) are also provided as supplementary materials (Figs. S1, S2). Also, as shown in Tables 1 and 2, in terms of the MCC score, the other ML models display mediocre classification quality, conversely, DeepAmp shows significant improvement in the classification quality. It demonstrate the effectiveness of DeepAmp over other classifiers in identification of positive and negative samples, consistently.

Figure 1
figure 1

Receiver operating characteristic (ROC) curves for 5-fold CV and 10-fold CV DeepAmp model.

Methods and materials

This section describes the proposed method and benchmark dataset presented in this study.

Benchmark dataset

Kielkowski et al.9 has identified the AMPylation in intact cancer cells via LC-MS/MS as well as imaging methods. Using a pronucleotide probe they identified the protein AMPylation in living cells. They synthesized an N6-propargyl adenosine phosporamidate proneucleotide (pro-N6pA) and treated different cell lines such as HeLa, SH-SY5Y, etc. to identify the sites. The AMPylated proteins found here are engaged in a variety of metabolic pathways, including a widely conserved key regulator of glycolysis ATP-dependent 6-phosphofructokinase (PFKP), proteolysis (CTSA, CTSB), regulation of PTMs (PPME1), and UPR (HSPA5 and SQSTM1). They identified a total of 162 protein sequences to be involved in this distinct modification. We investigated these proteins through UniProt database and identified a total of 133 unique protein sequences which are used to build our dataset.

We then use CD-Hit to remove proteins with over 40% sequential similarities to discard redundancy in the dataset31. The resulting dataset contains 130 unique proteins with less than 40% sequential similarities. After that, for each AMPylation and non AMPylation site, a 31-residue peptide containing central AMPylation /non AMPylation site with 15 residues upstream and 15 residues downstream was extracted. We tried different length of peptide-containing which among them, using 31-residue peptides attained the best results. To build the peptides sequence for AMPylation sites at the two ends of the proteins with less than 15 neighboring amino acids on each side, we use equalized by padding with “X” residue. As a result, a total of 153 peptides with AMPylated sites and 28872 peptides with non-AMPylated sites were extracted from 130 protein sequences. From the 28872 non-AMPylated sites, we selected 250 sequences randomly to balance our dataset having almost 2:1 ratio of negative to positive samples. Thus our final dataset of 403 peptide sequences with 153 AMPylated peptides and 250 non- AMPylated peptides was created. This dataset is available at https://github.com/MehediAzim/DeepAmp.

Feature encoding

Feature encoding is an important step in building an effective machine learning model. Binary profile features (also known as one-hot-encoding) are straightforward, yet shown to be very effective for the prediction of different functionalities in the multi-omics dataset32,33. In this study, we generate Binary profiles for each peptide, by representing each amino acid as a vector of 20 dimensions in term of one hot encoding. For instance, Alanine is replaced by a 20 size one hot vector which is [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]. As a result, a sequence of length L was represented by a vector of dimensions L 20. Considering L= 31 (length of peptides), we extract 620 features for each peptide (31 20). This feature encoding process is depicted in Fig. 2. Considering that we use convolutional neural network to build our model, binary profile can potentially provide extensive information to train our model.

Figure 2
figure 2

Binary profile feature generation of peptide sequences.

Classification technique

Convolutional neural network (CNN) is widely used in computational biology for predicting different biological and chemical functionalities and entities from multi-omics datasets. It has shown tremendous success in the prediction of different PTMs, cancer cell types classification tasks, origins of replication prediction, and many more34,35,36. Like any other neural network, a CNN consists of an input layer, hidden layer, and an output layer. Extracting feature maps using convolution operation makes the CNN architecture different from the regular neural nets. Unlike hidden layers of regular neural net which basically constructed by a set of fully connected neurons, the hidden layers of CNN mainly consist of a convolutional layer, pooling layer, and fully connected layer37.

The CNN architecture we used is depicted in Fig. 3. Our CNN classifier consists of three Conv1D layers with the number of filters and kernel sizes of [24, 7], [16, 5] and [8, 3], respectively. We also use two Maxpooling1D layers as well as two Dense layers. The input is the L 20 matrix where L is the length of the protein sequence (31). We applied one-dimensional kernels to the input vectors. The output of our first 1-D convolutional layer which can also be thought of as a motif scanner is then passed to the max-pooling layer. Among the three convolutional layers we used, max-pooling was applied in the first two of them. The last convolutional layer output is directly passed to a fully connected layer and the prediction layer. Rectified Linear Unit (ReLU) was used as activation function for each intermediate layer as it is popularly used for its simplicity and effectiveness38,39. In each of the convolutional layers and the fully connected layer, we used dropout to avoid overfitting40.

Even though for computer vision problems deeper CNN models provide the best result41, for biological sequence data which are presented in term of matrix as input, different studies have shown that increasing the depth of the convolutional layer does not necessarily lead to improvement in prediction accuracy specially for the smaller datasets similar to ours42. Furthermore, it reduces the chance of overfitting and requires fewer instances for training40,43. In order to prevent overfitting, we develop a shallow CNN architecture. With only 7825 trainable parameters the model provides a balanced result. Additionally, to prevent the overfitting we use two regularization methods namely, dropout and L2 for each Conv1D.

Figure 3
figure 3

Model architecture of DeepAmp.

Evaluation methods

In order to measure the efficacy of DeepAmp, k-fold cross-validation is used here. In k-fold cross-validation, the dataset is split into k subsets. From this k subset, k-1 is used for training and the remaining fold is used for validation. This way the whole dataset gets used for training. Since the training size gets bigger, the classifiers tend to show better results. We used stratified k-fold cross validation which maintains a fixed ratio of negative and positive sites in the training and validation dataset44. In this study, we evaluate our model using k = 5 and 10 as two common values for this parameter.

Conclusion

In this study, we presented a new dataset that can be used to evaluate computational methods specially machine learning based models to predict AMPylation PTM. On top of that, we proposed a new deep learning-based tool called DeepAmp for predicting AMPylation using CNN and binary profile feature vector. DeepAmp achieves an accuracy of 77.7% and sensitivity, specificity, MCC, and AUC score of 79.1%, 76.8%, 0.55, and 0.85 respectively for tenfold cross-validation. DeepAmp also significantly outperforms widely used machine learning models including Support Vector Machine, K-nearest Neighbor, and Random Forest for predicting AMPylation sites. Due to the limitation of the sample size available, prediction with high accuracy is strenuous. In the future refinement of our work, we aim to incorporate new AMPylation sites into the dataset and create a larger database for AMPylation PTM. Furthermore, we aim to ameliorate our predictor’s performance by using different feature sets and deeper CNN architectures. Our presented dataset and DeepAmp as an standalone predictor are publicly available at https://github.com/MehediAzim/DeepAmp.