Detection of dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach

Rani, R. Ranjani; Ramyachitra, D.; Brindhadevi, A.

doi:10.1038/s41598-019-47468-y

Download PDF

Article
Open access
Published: 31 July 2019

Detection of dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach

R. Ranjani Rani¹,
D. Ramyachitra¹ &
A. Brindhadevi¹

Scientific Reports volume 9, Article number: 11106 (2019) Cite this article

3735 Accesses
18 Citations
Metrics details

Subjects

Abstract

The accessibility of a huge amount of protein-protein interaction (PPI) data has allowed to do research on biological networks that reveal the structure of a protein complex, pathways and its cellular organization. A key demand in computational biology is to recognize the modular structure of such biological networks. The detection of protein complexes from the PPI network, is one of the most challenging and significant problems in the post-genomic era. In Bioinformatics, the frequently employed approach for clustering the networks is Markov Clustering (MCL). Many of the researches for protein complex detection were done on the static PPI network, which suffers from a few drawbacks. To resolve this problem, this paper proposes an approach to detect the dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach (DMCL-EHO). Initially, the proposed method divides the PPI network into a set of dynamic subnetworks under various time points by combining the gene expression data and secondly, it employs the clustering analysis on every subnetwork using the MCL along with Elephant Herd Optimization approach. The experimental analysis was employed on different PPI network datasets and the proposed method surpasses various existing approaches in terms of accuracy measures. This paper identifies the common protein complexes that are expressively enriched in gold-standard datasets and also the pathway annotations of the detected protein complexes using the KEGG database.

A multi-objective evolutionary algorithm for detecting protein complexes in PPI networks using gene ontology

Article Open access 15 May 2025

Reliable identification of protein-protein interactions by crosslinking mass spectrometry

Article Open access 11 June 2021

Reconstructing the evolution history of networked complex systems

Article Open access 02 April 2024

Introduction

The protein complexes are molecular combinations of proteins accumulated by multiple PPI networks, which plays a significant part in numerous biological processes. Several proteins are biologically functional only when they interact with their neighbour proteins and create their protein complex. It is crucial to recognize the sets of proteins that form complexes. Thus, numerous computational approaches have been developed to detect and predict protein complexes from the PPI networks.

High-throughput approaches have created a huge quantity of protein interactions that helps to discover the protein complexes from a large PPI network. During the clustering process, the PPI network is considered as an undirected graph N_et = (V_er, E_dg) where V_er is the set of nodes and E_dg is the set of edges. The set of nodes signifies the proteins and set of edges signifies the interaction between proteins.

To cluster the PPI, the network has been modelled into two types, static PPI network that detects the protein functional modules and the second is the dynamic PPI network that detects protein complexes. The dynamic PPI network is defined as the division of static PPI in a series of time-sequenced subnetworks using gene expression data. There exists the variance between protein functional module and protein complexes. The protein functional module is defined as the cluster of proteins which contributes to a specific cellular process and binds with each other at various time points, whereas protein complexes are defined as the cluster of proteins that interacts with each other at the same time point¹.

Many computational approaches of protein complex detection have been focused on static PPI that extract the dense region in PPI networks, which concentrates only on the topological structure of PPI. Some of the methods that use the static PPI for protein complex detection are MCode², CFinder³, MCL⁴, COACH⁵, ClusterOne⁶, RNSC⁷, CMC⁸, and many more. Maulik et al., identified the protein complexes using non-cooperative sequential game⁹.

As PPI network continuously transforms with respect to the environment, time and various phases of the cell cycle, the clustering analysis on static PPI does not emulate these dynamic attributes and it is far from optimal solution. Thus, in recent times, various attempts on the clustering process of dynamic PPI network has been initiated along with the gene expression data to enhance the protein complex detection. Also, many evolutionary approaches were employed for analysing the clustering process of the PPI network such as ant colony optimization ACC-DPC¹, ACO-MCL¹⁰, cuckoo search optimization (CSO)¹¹, BiCAMWI using genetic algorithm¹², Soft Regularized-MCL¹³, particle swarm optimization (PSO-MCL)¹⁴ and artificial fish school algorithm (AFA-MCL)¹⁵. The firefly optimization was employed along with Markov Clustering (F-MCL) on the dynamic PPI network for predicting complexes. The execution time for F-MCL is higher as all the fireflies (proteins) in the population (network) tries to reach the optimal solution (cluster). There are few proteins that are not eligible to come under the cluster and take more iterations to reach the cluster, which may take a long time¹⁶.

The above-mentioned approaches were effective, but they do not promise a global outcome since they suffer from the effect of unwanted clusters which leads to time consuming. In order to discard the drawbacks of the above-mentioned approaches, a novel approach was proposed to detect the dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach. One of the most important advantages for EHO is that it is the most computationally efficient and has less time consuming compared to F-MCL and other approaches. This is because the unwanted noisy data (unclustered proteins) will be removed from the clan separating operation of EHO approach. The remaining sections of this paper is ordered as follows: Section 2 discusses briefly about the methodology of the proposed approach. Section 3 illustrates the experimental results with various performance measures, Section 4 deliberates about the implementation and discussion of the proposed method in detail and finally Section 5 concludes the paper and recommends for the future enrichments.

Methods

For detecting the protein complexes, initially, the proposed method divides a static PPI network into a sequence of subnetworks below diverse time points by combining gene expression data to form dynamic model. In order to build a dynamic model, the static PPI network is integrated with gene expression data, which declare the level of gene expression, as well as protein expression. As a protein does not always becomes active at a cell cycle, it is assumed that a protein was active at the time points with its highest expression level¹⁷. The expression level of a protein will be increased before its expression and will be decreased once the protein has completed its function, and the time points are identified with its expression level, which are higher than a threshold.

Given is a static PPI network P_P = (P_ver, P_Edg), where P_ver, is a set of proteins and P_Edg, is a set of interactions between these proteins. In gene expression data, there is a series of T time stamps coming with |P_ver| × (T * TR) matrix M, where TR is the number of repetitions of the time series. Each element M(P_ver, j) of this matrix represents the level of gene expression.

The three-sigma principle is employed to determine if a gene is expressed in a single stamp. For each gene P_ver, the gene expression is defined as given in the following Eqs (1–5)

$${{\rm{Ev}}}_{{\rm{i}}}({{\rm{P}}}_{{\rm{ver}}})=\frac{{\sum }_{{\rm{tr}}=1}^{{\rm{TR}}}{\rm{M}}({{\rm{P}}}_{{\rm{ver}}},{\rm{i}}+{\rm{T}}\times ({\rm{tr}}-1))}{{\rm{TR}}}$$

(1)

$${\rm{UE}}({{\rm{P}}}_{{\rm{ver}}})=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{T}}}{{\rm{Ev}}}_{{\rm{i}}}({{\rm{P}}}_{{\rm{ver}}})}{{\rm{T}}}$$

(2)

$${{\rm{\sigma }}}^{2}({{\rm{P}}}_{{\rm{ver}}})=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{T}}}{({{\rm{Ev}}}_{{\rm{i}}}({{\rm{P}}}_{{\rm{ver}}})-{\rm{UE}}({{\rm{P}}}_{{\rm{ver}}}))}^{2}}{{\rm{T}}}$$

(3)

$${\rm{Fl}}({{\rm{P}}}_{{\rm{ver}}})=\frac{1}{1+{{\rm{\sigma }}}^{2}({{\rm{P}}}_{{\rm{ver}}})}$$

(4)

$$\begin{array}{c}{\rm{AT}}({{\rm{P}}}_{{\rm{ver}}})={{\rm{S}}}_{1}({{\rm{P}}}_{{\rm{ver}}})\times {\rm{Fl}}({{\rm{P}}}_{{\rm{ver}}})+{{\rm{S}}}_{2}({{\rm{P}}}_{{\rm{ver}}})\times (1-{\rm{fl}}({{\rm{P}}}_{{\rm{ver}}}))\\ \,\,\,\,\,\,=\,{\rm{UE}}({{\rm{P}}}_{{\rm{ver}}})+3{\rm{\sigma }}({{\rm{P}}}_{{\rm{ver}}})\,(1-{\rm{fl}}({{\rm{P}}}_{{\rm{ver}}}))\end{array}$$

(5)

where Ev_i(P_ver) is the mean of the expression value of gene P_ver at timestamp i, UE(P_ver) is the mean of its expression values over times ranging from 1 to T, σ(P_ver) is the standard deviation of its expression values, Fl(P_ver) is used to show fluctuation of the expression curve of gene P_ver. Suppose that the gene expression data is governed by a normal distribution, then S₁(P_ver) and S₂(P_ver) are the associated mean and three-sigma value, that is S₁(P_ver) = UE(P_ver) and S₂(P_ver) = UE(P_ver) + 3σ(P_ver). In virtue of three-sigma principle, the probability that a value greater than S₂(P_ver) is not an active point is less than 0.1%. AT(P_ver) is the active threshold of gene P_ver. Consider the gene (P_ver) at timestamp i. If Ev_i(P_ver) > AT(P_ver), then the gene P_ver is expressed and the gene product exists¹⁶.

In the clustering procedure of every subnetwork, the proposed method starts with constructing the initial protein clusters depending on the protein complexes attained at the prior time point. The initial clusters constructed in the first generation have three steps The procedure for constructing initial clusters has three steps: seed node selection, attachment nodes addition and finally refining¹. To clearly demonstrate the three steps, a subnetwork of time point t with ${P}_{p}^{t}$ = (${P}_{ver}^{t}$, ${P}_{Edg}^{t}$), where ${P}_{ver}^{t}$, is a set of proteins and ${P}_{Edg}^{t}$, is a set of interactions between these proteins at the time t.

1.
Selecting seed nodes: This step first computes the clustering coefficient of every node. Then it selects the nodes whose clustering coefficients are greater than a given threshold λ_c as seed nodes, and puts them into the set of seed nodes at the current time point t, denoted by S^t. The seed nodes are considered as the candidate clustering centers and represent different clusters of protein complexes. The clustering coefficient of any node i is defined in Eq. (6):
$${\rm{\Psi }}=\frac{2\times {n}_{i}^{t}}{|Neigh(i)|\times (|Neigh(i)|-1)}$$
(6)
where Neigh(i) = {j є ${P}_{ver}^{t}$|(i · j) є ${P}_{Edg}^{t}$} represents the neighbor nodes of node i, and |Neigh(i)| is the number of neighbor nodes of node i, ${n}_{i}^{t}$ is the number of links between neighbour nodes of i at the time point t.
2.
Attachment nodes addition: For any seed node i (i є ${S}^{t}$) of current time point t, if it is also the seed node of previous time point (t − 1), then the nodes which are in the cluster i at the previous time point (t − 1) and also exists in the subnetwork ${P}_{p}^{t}$ at the current time point t are put into the cluster i of current time point t. In this way, initial clusters are built. However, some clusters may be too sparse since that not all proteins of previous time point (t − 1) exist at the current time point t. Thus, a refining step is needed to be carried out on the initial clusters.
3.
Refining: For any initial cluster of protein complex ${c}_{i}^{t}$ at the current time point t, if its density is smaller than a given threshold λ_d all the nodes in ${c}_{i}^{t}$ are sorted in a descending order according to their clustering coefficients, and the node with the smallest clustering coefficient is iteratively removed until the density of cluster ${c}_{i}^{t}{\rm{i}}$ is not smaller than the given threshold λ_d. The density of a protein complex ${c}_{i}^{t}$ is computed by Eq. (7):

$$den({c}_{i}^{t})\frac{2\times {l}_{i}}{{n}_{i}\times ({n}_{i}-1)}$$

(7)

where n_i and l_i are number of nodes and edges in cluster ${c}_{i}^{t}$ respectively¹.

Now, the clustering analysis of the remaining generations is employed by utilizing the Markov Clustering technique along with the EHO algorithm on every subnetwork. The matrix is constructed that depicts the probabilities of transition of a Markov Chain (random walk) based on the graph. The MCL procedure comprises of two activities such as expansion and inflation, which was applied to the matrix that was constructed. The construction of matrix M_at for a graph description and the process of Markov clustering method is briefly described¹⁸.

Let P_p = (P_ver, P_Edg), where P_ver, is a set of proteins and P_Edg, is a set of interactions between these proteins. Denote a node in P_ver by p_vi and an edge between p_vi and p_vj in P_Edg by (p_vi, p_vj), in which i and j are the indexes of the corresponding nodes¹⁶. W(p_vi, p_vj) is the weight of edge (p_vi, p_vj), which represents the confidence level of the interaction in a weighted PPI networks. Adj is the adjacency matrix of a weighted graph given as Eq. (8),

$${\rm{Adj}}({\rm{i}},{\rm{j}})=\{\begin{array}{ll}{\rm{W}}({{\rm{p}}}_{{\rm{vi}}},{{\rm{p}}}_{{\rm{vj}}}) & {\rm{if}}\,({{\rm{p}}}_{{\rm{vi}}},{{\rm{p}}}_{{\rm{vj}}})\in {{\rm{P}}}_{{\rm{Edg}}}\\ {{\rm{\max }}}_{{\rm{x}}\ne {\rm{j}}}\,{\rm{W}}({{\rm{p}}}_{{\rm{vi}}},{{\rm{p}}}_{{\rm{vj}}}) & {\rm{if}}\,({{\rm{p}}}_{{\rm{vi}}}={{\rm{p}}}_{{\rm{vj}}})\\ 0 & {\rm{else}}\end{array}$$

(8)

A canonical flow matrix M_at is an n × n (n = |P_ver|) matrix that shows the probabilities of transition of a random walk defined on the graph. M_at(i, j) represents the probability of a transition from node p_vi to p_vj. The transition probability from p_vi to p_vj is referred to as the stochastic flow from p_vi to p_vj. All the elements in each column of M_at will sum up to 1 and the matrix is expressed as given in Eq. (9)

$${{\rm{M}}}_{{\rm{at}}}({\rm{i}},{\rm{j}})=\frac{{\rm{Adj}}({\rm{i}},{\rm{j}})}{{\sum }_{{\rm{k}}=1}^{{\rm{n}}}{\rm{Adj}}({\rm{k}},{\rm{j}})}$$

(9)

The three crucial parameters of MCL are inflation constant (ic), balance (b) and penalty proportion (P_p), where ic defines the size of each cluster, b defines the user-specific balance constant that is employed for penalizing higher-propensity neighbours and P_p defines the penalty ratio of the protein nodes, which is also user-specified¹⁶. The clustering process using EHO algorithm is briefly explained here for clustering protein complexes. The overall flowchart of the proposed method is shown in Fig. 1.

Elephant herd optimization

One of the contemporary swarm intelligence technique is the elephant herd optimization which was projected in 2016¹⁹. This algorithm was stimulated by the herding characteristics of elephants. In general, elephants are social mammals with the composite social group comprising of numerous clans under the guidance of a matriarch. A clan comprises of one or more female elephant with their calves. Female desires to live in domestic clusters while male elephants prefer to live alone and they will exit from the clan when they grow with each generation²⁰. The characteristics of the clans signifies exploitation and leaving elephants signifies the exploration of the population.

The characteristics of an elephant are measured using two main operators, namely clan updating and clan separating operators that are used for producing better clustering of proteins. Here, the elephant population is referred to as the static PPI network, each clan is referred to as the dynamic PPI subnetwork, and the elephants inside each clan is represented as proteins.

Clan updating operator

The static PPI is initially separated into k dynamic PPI. Each dynamic PPI is headed by the individual protein, which represents the best solution of the dynamic PPI. In each generation, protein e of dynamic PPI cl_i moves towards the ${p}_{best,c{l}_{i}}$ which has the best fitness in dynamic PPI cl_i. The fitness of the dynamic PPI is computed by employing the accuracy values of the protein complex. For new protein e in dynamic PPI cl_i, the position is updated by following Eq. (10).

$${p}_{new,c{l}_{i},e}={p}_{c{l}_{i},e}+\alpha ({p}_{best,c{l}_{i}}-{p}_{c{l}_{i},e})\times rand$$

(10)

where ${p}_{new,c{l}_{i},e}$ is the new position of protein e in dynamic PPI cl_i and ${p}_{c{l}_{i},e}$ denotes the position in previous generation. ${p}_{best,c{l}_{i}}$ signifies dynamic PPI cl_i which has the best fitness, α is the scale factor that determines the influence of best fitness and rand is the random variable employed to enhance the diversity of the populations and defined in the range (0, 1).

The movement of a protein e for best fitness can be updated using Eq. (11).

$${p}_{best,c{l}_{i},e}=\beta \times {p}_{center,c{l}_{i}}$$

(11)

where β belongs to (0, 1) which is a scale to regulate the effect of ${p}_{center,c{l}_{i}}$ on ${p}_{best,c{l}_{i},e}$. ${p}_{center,c{l}_{i}}$ is the centre of dynamic PPI cl_i and for the di^th dimension it can be computed using the Eq. (12).

$${p}_{center,c{l}_{i},d}=\frac{1}{{n}_{c{l}_{i}}}\times \sum _{e=1}^{{n}_{c{l}_{i}}}{p}_{c{l}_{i},e,d}$$

(12)

where 1 ≤ di ≤ D, denotes the di^th dimension and D is its total dimension. ${n}_{c{l}_{i}}$ is the number of proteins in dynamic PPI cl_i, ${p}_{c{l}_{i},e,d}$ is the di^th dimension of the protein in ${p}_{c{l}_{i},e}$. The centre of the dynamic PPI cl_i is computed through DI computations using Eq. (12). The pseudocode for the dynamic PPI updating operator is depicted in Algorithm 1.

Clan separating operator

To enhance the search capacity of the proposed method, the unclustered proteins and clusters with the lowest fitness will exit in every generation as given in Eq (13)¹⁹.

$${p}_{worst,c{l}_{i}}={p}_{min}+({p}_{max}-{p}_{min}+1)\times rand$$

(13)

where p_max and p_min are the upper and lower bound of the single protein. ${p}_{worst,c{l}_{i}}$ is the protein or complex with the lowest fitness. The rand is the random variable that has stochastic and uniform distribution in the range (0, 1). The pseudocode for the clan separating operator is given in Algorithm 2.

Depending on the clan updating and separating operator, the module of the proposed algorithm is framed as given in Algorithm 3.

The relationship between the DMCL-EHO and the protein complex is given in the Table 1.

Table 1 The association between the components of DMCL-EHO and the protein complex

Full size table

Experimental Results

Datasets

In this experiment, the datasets which consists of interactions for both Saccharomyces cerevisiae and Homo Sapiens are DIP²¹, BioGRID²² and STRING²³. The benchmark PPI datasets employed only for Saccharomyces cerevisiae are Gavin2 and Gavin6²⁴, Krogan-core and Krogan-extended²⁵, Collins²⁶, and WI-PHI²⁷. The Gavin + Krogan dataset was generated by merging Gavin and Krogan Core datasets. The PPI datasets employed only for Homo Sapiens are HPRD²⁸, HPID²⁹ and PIPs³⁰. Table 2 shows the list of datasets used in this experiment.

Table 2 List of datasets and gold standard benchmark databases.

Full size table

The gene expression data used in this study for Saccharomyces cerevisiae (GSE3431)³¹ and Homo Sapiens (GSE3933)³² are taken from the GEO database.

The predicted complexes are compared to gold standard benchmark databases such as CYC2008³³, MIPS³⁴, SGD³⁵ for Saccharomyces cerevisiae organism and PCDq³⁶ benchmark dataset for Homo sapiens organism. The percentage of overlapping interactions among the datasets in Gavin2 is 32%, Gavin6 is 53%, Krogan-core is 46%, Collins is 56%, HPRD is 23%, PIPs is 57%, DIP is 2%, BioGRID is 55% and STRING is 47%^37,38.

Performance measures

To evaluate and compare the clustering results of predicted protein complexes, the generated complexes were compared and matched with the gold standard benchmark protein complexes. Assume P_r(V_Pr,E_Pr) and B_e(V_Be,E_Be) be the set of vertices (proteins) and edges (interaction) of a predicted protein complex and benchmark protein complexes.

Complex similarity score (CSS)

CSS is defined as the closeness of two protein complexes namely predicted (P_r) and benchmark (B_e) protein complexes and they are computed based on Eq. 14.

$$CSS({P}_{r},{B}_{e})=\frac{|{V}_{Pr}\,\,{V}_{Be}{|}^{2}}{|{V}_{Pr}|\ast |{V}_{Be}|}$$

(14)

where V_pr and V_Be denotes the set of proteins in predicted and benchmark protein complexes. If CSS(P_r, B_e) is equal to 0, it denotes that the predicted and benchmark protein complexes do not have any common protein complexes. On the contradictory, if CSS(P_r, B_e) is equal to 1, then the predicted complex P_r(V_Pr, E_Pr) has the same equal nodes as the benchmark complex B_e(V_Be, E_Be). Here, if CSS(P_r, B_e) > 0.2, it is considered as the predicted and benchmark protein complexes are identical³⁹.

Now, to assess the performance of predicted protein clusters, four commonly employed measures are utilized such as Precision, Recall, F-Measure, Coverage Ratio and Accuracy.

Precision

Precision is defined as the accuracy of predicted protein complexes that are identical to the benchmark protein complexes. If the precision value is high, it indicates that the predicted complexes are likely to be true positive. The precision of the protein complexes is computed based on Eq. (15).

$$Precision=\frac{{N}_{Pc}}{|Predicte{d}_{set}|}$$

(15)

Recall

Recall is defined as the accuracy of benchmark protein complexes that are identical to the predicted complexes. If the recall value is high, it indicates that the predicted complex has a good number of coverage of the proteins in the gold standard complexes. The recall of the protein complexes is computed based on Eq. (16).

$$Recall=\frac{{N}_{Bc}}{|Know{n}_{set}|}$$

(16)

where N_Pc is denoted as the number of predicted complexes which match at least one recognized benchmark complex, N_Bc is denoted as the number of recognised benchmark complexes which match at least one predicted complex, Predicted_set is denoted as the set of complexes predicted by the proposed approach and Known_set is denoted as the set of recognised gold standard benchmark protein complexes.

Coverage ratio (CR)

CR is defined as the fraction of proteins in benchmark complex V_Be found in predicted complex V_pr and they are computed based on Eq. (17).

$$CR=\frac{\sum _{i}{\max }\,{T}_{i,j}}{{\sum }_{i}|{V}_{Be}|}$$

(17)

where V_Be is denoted as the set of proteins in benchmark protein complexes. T_{i, j} is denoted as the common number of proteins between V_pr and V_Be.

F-Measure

F-Measure is defined as the harmonic mean, i.e., a rational mixture of both precision and recall and it is computed based on Eq. (18).

$$F-Measure=\frac{2\,(Precision\ast Recall)}{(Precision+Recall)}$$

(18)

Accuracy

Accuracy is defined as the geometrical mean i.e the trade-off between precision and recall and it is computed based on Eq. (19).

$$Accuracy=\sqrt{Precision\ast Recall}$$

(19)

Number of Clusters

The number of clusters is defined as the total quantity of clusters formed from the PPI network after the clustering process has been completed.

The performance measures such as coverage ratio, the number of clusters, precision, recall, f-measure and accuracy of the proposed method for Saccharomyces cerevisiae are compared with various datasets and existing algorithms against CYC2008 benchmark database and the graphical representation of the comparison is depicted in Figs 2–7. Also, the performance measures such as coverage ratio, the number of clusters, precision, recall, f-measure and accuracy of the proposed method for Homo sapiens are compared with various datasets and existing algorithms against PCDq benchmark database and the graphical representation of the comparison is depicted in Figs 8 and 9. The comparison of performance measures for the proposed method with various datasets and existing algorithms against the MIPS and SGD benchmark database for Saccharomyces cerevisiae is given in supplementary material.

From Figs 2 and 8, it is inferred that the number of clusters in the proposed method is less when compared to FOCA, AFA-MCL and ACO-MCL as they try to get solution from all the proteins in the network. These methods will not discard the undesirable proteins which may result in false positives. But in the proposed method, the clusters which has less than three proteins are discarded. Hence the precision, recall, F-Measure and accuracy are high for the proposed method.

From Figs 3 and 8, it is observed that the proposed method has more coverage ratio than the existing methods since it employs the iterated clustering approach. This enhances the coverage of proteins in the network as the proteins in the benchmark complexes are highly found in the predicted complexes. From Figs 4–7 and 9 it is observed that the precision, recall, F-Measure and accuracy shows fluctuations for PSO-MCL, ACO-MCL, AFA-MCL, F-MCl, FOCA and EHO-MCL. The mean of these measures for all the datasets shows that the proposed method performs better than the existing methods because it has employed the dynamic PPI along with EHO.

Implementation and Discussion

The computational issue of attaining a solution with a high accuracy solution for protein complex detection from dynamic PPI is still a challenging task. In this paper, the elephant herd optimization algorithm along with Markov clustering technique is combined to solve the protein complex detection problem. The proposed method provides an enhancement of the results compared to all the other popular existing methods. This work was executed on 2.00 GHz Intel i3 with 8GB of memory running on Windows 10.

The number of clusters is small in an average when compared to other existing methods, due to the deletion of proteins without interactions. Here, the minimum number of proteins inside a cluster should be three or more and that are considered as a protein complex. The protein cluster with less than three proteins are removed. The proposed method was evaluated based on the removal of noise, insertion and deletion of random protein interactions, large PPI network, namely WI-PHI, various parameter analysis, statistical significance and finally with biological significance.

Evaluation by noise removal

The PPI networks are obtained from high-throughput experiments, the large coverage of the PPI network comprises of noise in the format of false positive interactions and redundant data. The main challenge of clustering these PPI networks is present in the PPI networks itself. In this method, after the clustering process is accomplished, the proteins that do not present in any of the clusters is also considered as a noise. These solitary proteins that do not interact with any other proteins will not provide any valuable information. The minimum number of proteins inside the cluster is set to be three in this work. Thus, the isolated proteins and clusters with below three proteins are considered as a noise and they are removed by the clan separating operator by the elephant herd optimization method. Many evolutionary approaches are inheriting the undesirable proteins from one generation to another which may lead to loss of accuracy, but EHO approach will discard the undesirable proteins from the population in the clan separating operator that leads to the optimal solution. The comparison of EHO with other existing methods is depicted in Figs 2–9.

Evaluation by adding and removing random protein interactions

The testing of the proposed method is accomplished by inserting and deleting the random interactions of the PPI network to evaluate its performance. The noise can also be any missing information (false negatives) or added noise (false positives) in the PPI network. The DIP dataset is used for evaluation of adding and removing random interactions. The missing information of PPI network is processed by removing the proportion of edges randomly (0%, 20%, 40%, 60%, 80%) and the false positive information of PPI network is processed by adding the proportion of edges randomly (0%, 20%, 40%, 60%, 80%, 100%). The performance of the proposed method by adding and removing the random interactions are depicted in Figs 10 and 11.

From the Figs 10 and 11, it is observed that even though the random insertions and deletion of the protein interactions are employed on the dataset, the proposed method performs better than other existing approaches.

Evaluation by large PPI network WI-PHI dataset

In addition to analyse the performance of the proposed method on the large PPI dataset, WI-PHI²⁷ dataset of Saccharomyces cerevisiae was employed which comprises of 5955 proteins and 50,000 protein interactions. The proposed method and also the existing methods were executed on this large dataset and compared the predicted clusters with the various gold standard benchmark databases. The comparison of the existing and proposed method on WI-PHI dataset is depicted in Figs 2–7.

Evaluation by parameter analysis

Generally, every metaheuristic approach is based on certain stochastic dissemination. Hence, diverse runs will produce various diverse results. This work implements 500 independent runs in order to score optimal solution. In general, 20 numbers of clans were employed as per literature. The execution process will be terminated, if the best result generated in each iteration remains interchangeable for 100 successive iterations or the maximum number of generations is attained. The assignment of parameter values was adjusted based on the experimental results. It was identified that the parameters of the proposed method that has values of α = 0.5 and β = 0.1 produced better solution among different values and hence were allocated. It was observed that the optimal solution was identified after 315^th generation. For all the performance measures, there were fluctuations during the first 10 runs of the experiment and in the future runs reliability was observed. Figures 2–9 shows the average outcome of performance measures for the above parameter values of the proposed method. Table 3 shows the various parameter values for the proposed approach and the other existing approaches of protein complex detection.

Table 3 Various Parameter Values of proposed and existing methods for protein complex detection.

Full size table

Evaluation by statistical significance

The proposed method was also assessed by utilizing non-parametric test such as, Wilcoxon Matched-Pair Signed-Rank Test among each pair of approaches that produces the statistical consequence. The discrepancy between the F-Measure and Accuracy for every entry in Figs 6 and 7 was tested based on the confidence level of 1% (p-value < 0.01). The p-value less than 0.01 are assumed as highly significant and the values greater than 0.01 are assumed as insignificant values. The scores of F-Measure and Accuracy is alone considered as they are computed based on precision and recall. The Statistical Significance of the proposed and existing approaches based on F-Measure and Accuracy is depicted in Table 4. The scores of upper right positions of the table are attained from F-Measure of proposed and various existing algorithms based on DIP dataset against CYC2008 benchmark database. The scores of lower left positions of the table are attained from Accuracy of proposed and various existing algorithms based on DIP dataset against CYC2008 benchmark database. From Table 4, it is shown that the proposed method is statistically significant in nature compared to all the existing methods.

Table 4 Statistical Significance of proposed and existing approaches based on F-Measure and Accuracy.

Full size table

Evaluation by biological significance

Many of the existing methods solve the protein complex detection problem based on the topological similarity. But to obtain some useful biological information, the computational methods should be biologically significant in nature. This proposed method is evaluated in the biological significance test. The gold standard benchmark databases are manually annotated based on the information from biologically experimental analysis. Thus, the detected protein complexes obtained from the proposed method is compared and matched with the benchmark databases. Few benchmark databases such as CYC2008, MIPS, SGD databases for Saccharomyces cerevisiae and the PCDq database for Homo sapiens are employed for assessing the proposed method. Table 5 displays the common protein complexes between the CYC2008 benchmark database and the proposed method for DIP and Krogan-extended. Also, the common biological process, molecular function and the cellular component of the obtained protein complexes are displayed. Correspondingly, the common pathway annotations of the predicted protein complexes are obtained from the KEGG database are displayed.

Table 5 Top 5 Common Protein Complexes, Gene Ontology Functions and KEGG Pathways of the Predicted Complexes of proposed method.

Full size table

The predicted complex gene ontology and KEGG pathway enrichment analysis were predicted by using the DAVID gene function classification online tool. The overall predicted complex enrichment score and the respective gene ontology elements and KEGG pathway enrichment scores are displayed in Table 5. The pictorial representation of the common RNA Polymerase KEGG Pathway of the predicted protein Complex on Krogan-extended dataset and common Oxidative Phosphorylation KEGG Pathway of the predicted protein complex on DIP dataset is exhibited in the supplementary information. The RNA polymerase is essential for nucleolar assembly and for high polymerase loading rate. Oxidative phosphorylation is the metabolic pathway in which cells use enzymes to oxidize nutrients, thereby releasing energy which is used to produce adenosine triphosphate (ATP)^40,41,42. The pictorial representation of the Top 5 common protein complexes, gene ontology functions and KEGG pathways of the predicted complexes of proposed method is given as Venn diagram in Figs 12 and 13.

Execution time

Besides the accuracy, the time required to detect the dynamic protein complexes is also an important factor. Processing the various benchmark datasets with various numbers of proteins and different interactions requires more time complexity due to stochastic optimization methods. Subsequently, not all methods were available under the same platform, the execution of many of the approaches were done on virtual machines, which prohibited us from accomplishing an exact comparison of their relative execution times. Thus, here the average execution time of SR-MCL, ACO-MCL, PSO-MCL, AFA-MCL, F-MCL, FOCA AND EHO-MCL is displayed in Fig. 14.

From Fig. 14, it is observed that in this research, the proposed algorithm has less execution time when compared to other algorithms, due to the clan separating operator of EHO approach. It is inferred that the proposed EHO-MCL is efficient for detecting dynamic protein complexes. In future, the EHO-MCL can be further optimized in multicore CPU.

Conclusion

The Protein Complex detection is an exposed problem for scientists. The solution for the complex problem should be recurrently improved as they are important in the analysis of the biological process. The volume of PPI networks has also been increased due to high-throughput experiments, the lack of accurate computational model for protein complex detection exists. Many of the existing researches were employed on the static PPI data that do not provide accurate biological results. Thus, in this proposed method initially, the static PPI data is converted into dynamic PPI data by integrating the gene expression data. Later, every dynamic subnetwork was clustered based on the popular clustering technique MCL along with the elephant herd optimization method for exploring and exploiting the better solution. The proposed method was employed on various 11 widespread datasets and the predicted complexes were compared with 4 different benchmark databases. Also, the proposed method was evaluated based on noise removal, insertion and deletion of random protein interactions, using the large PPI dataset, various parameter analyses, statistical significance and biological significance. On every evaluation phase, the proposed method was outperforming all other existing approaches and identified the common protein complexes, Gene Ontology functions and KEGG pathways of predicted protein complexes. As a future work, additional information on the unknown protein complexes predicted by the proposed method is to be addressed with the help of biological experts. The proposed method can also be applied and analyzed on weighted PPI networks. Also, various other diseased databases can be used to experiment.

References

Yang, C., Ji, J. & Lv, J. Identifying Protein Complexes Method Based on Time-sequenced Association and Ant Colony Clustering in Dynamic PPI networks. Proc. IEEE 16th Int Conf on Bioinfo and Bioeng, 21–27 (2016).
Bader, G. D. & Hogue, C. W. V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinfo, 4(2) (2003).
Adamcsek, B., Palla, G., Farkas, I. J., Derényi, I. & Vicsek, T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8), 1021–1023 (2006).
Article CAS Google Scholar
Dongen, V. Graph clustering by flow simulation. (Ph.D. thesis, University of Utrecht, 2000).
Wu, M., Li, X., Kwoh, C.K. & Ng, S.K. A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics 10(169) (2009).
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 9(5), 471–472 (2012).
Article CAS Google Scholar
King, A. D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinform. 20, 3013–3020 (2004).
Article CAS Google Scholar
Liu, G., Wong, L. & Chua, H. Complex discovery from weighted PPI networks. Bioinformatics. 25(15), 1891–1897 (2009).
Article CAS Google Scholar
Maulik, U., Basu, S. & Ray, S. Identifying protein complexes in PPI network using non-cooperative sequential game. Sci Rep, 7(8410), (2017).
Seckiner, S. U., Eroglu, Y., Emrullah, M. & Dereli, T. Ant colony optimization for continuous functions by using novel pheromone updating. Appl. Math. Comput. 219, 4163–4175 (2013).
MathSciNet MATH Google Scholar
Zhang, Y. et al. Protein Complex Prediction in Large Ontology Attributed Protein-Protein Interaction Networks. IEEE/ACM Trans on Comput Biol and Bioinfo, 10(3) (2013).
Article CAS Google Scholar
Lakizadeh, A. & Jalili, S. BiCAMWI: A Genetic-Based Biclustering Algorithm for Detecting Dynamic Protein Complexes, PLoS ONE 11(7) (2016).
Shih, Y. K. & Parthasarathy, S. Identifying functional modules in interaction networks through overlapping Markov clustering. Bioinform. 28, 473–479 (2012).
Article Google Scholar
Kennedy, J. & Eberhart, R. C., Particle swarm optimization, Proc of IEEE Int Conf on Neural Networks, IV, Piscataway, NJ, IEEEPress, 1942–1948. (1995).
Ma, Q. & Lei, X. Application of artificial fish school algorithm in UCAV path planning. Proc IEEE Fifth Int Conf on BioIns Comp: Theoand Appl, 555–559. (2010).
Lei, X., Wang, F., Wu, F. X., Zhang, A. & Pedrycz, W. Protein complex identification through Markov clustering with firefly algorithm on dynamic protein–protein interaction networks. Info Sci 329, 303–316 (2016).
Article Google Scholar
Wang, J., Peng, X., Li, M. & Pan, Y. Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics 13(2), 301–312 (2013).
Article CAS Google Scholar
Vlasblom, J. & Wodak, S. J. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinform. 10(99) (2009).
Wang, G. G., Deb, S., Gao, X. Z. & Coelho, L. D. S. A new metaheuristic optimisation algorithm motivated by elephant herding behaviour. Int Jnl of Bio-Ins Compu 8(6), 394–409 (2016).
Article Google Scholar
Tuba, V., Beko, M. & Tuba, M. Performance of Elephant Herding Optimization Algorithm on CEC 2013 real parameter single objective optimization. WSEAS Trans on Sys 16, 100–105 (2017).
Google Scholar
Xenarios, I. et al. DIP, the Database of Interacting Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions. Nuc Acids Res 30(1), 303–305 (2002).
Article CAS Google Scholar
Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–539 (2006).
Article CAS Google Scholar
Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nuc Acids Res. 45, D362–D368 (2017).
Article CAS Google Scholar
Gavin, A. C. et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 440, 631–636 (2006).
Article ADS CAS Google Scholar
Krogan, N. J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 440, 637–643 (2006).
Article ADS CAS Google Scholar
Collins, S. R. et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 6, 439–450 (2007).
Article CAS Google Scholar
Kiemer, L., Costa, S., Ueffing, M. & Cesareni, G. WI-PHI: a weighted yeast interactome enriched for direct physical interactions. Proteomics. 7(6), 932–43 (2007).
Article CAS Google Scholar
Keshava Prasad, T. S. et al. Human Protein Reference Database–2009 update. Nucl Acids Res. 37, D767–D772 (2009).
Article CAS Google Scholar
Han, K., Park, B., Kim, H., Hong, J. & Park, J. HPID: The Human Protein Interaction Database. Bioinfo. 20(15), 2466–2470 (2004).
Article CAS Google Scholar
McDowall, M. D., Scott, M. S. & Barton, G. J. PIPs: Human protein-protein interactions prediction database. Nucl Acids Res. 37, D651–D656 (2009).
Article CAS Google Scholar
Tu, B. P., Kudlicki, A., Rowicka, M. & McKnight, S. L. Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science 310(5751), 1152–1158 (2005).
Article ADS CAS Google Scholar
Lapointe, J. et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 101(3), 811–816 (2004).
Article ADS CAS Google Scholar
Pu, S., Wong, J., Turner, B., Cho, E. & Wodak, S. J. Up-to-date catalogues of yeast protein complexes. Nuc Acids Res. 37(3), 825–31 (2009).
Article CAS Google Scholar
Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nuc Acids Res. 30(1), 31–34 (2002).
Article CAS Google Scholar
Cherry, J. M. et al. Saccharomyces Genome. Database: the genomics resource of budding yeast, Nuc Acids Res 26(1), 73–79 (1998).
CAS Google Scholar
Kikugawa, S. et al. PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from h-invitational protein-protein interactions integrative dataset. BMC Syst Biol, S2–S7 (2012).
Aragues, R., Garcia-Garcia, J. & Oliva, B. Integration and prediction of PPI using Multiple Resources from Public Databases. Jnl of Proteomics & Bioinfo. 1, 166–187 (2008).
Article CAS Google Scholar
Lehne, B. & Schlitt, T. Protein-protein interaction databases: keeping up with growing interactomes. Hum Genomics. 3(3), 291–297 (2009).
CAS PubMed PubMed Central Google Scholar
Li, X., Wu, M., Kwoh, C. K. & Ng, S. K. Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 11(1) (2010).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).
Article CAS Google Scholar
Kanehisa, M., Sato, Y., Furumichi, M., Morishima, K. & Tanabe, M. New approach for understanding genome variations in KEGG. Nucleic Acids Res. 47, D590–D595 (2019).
Article Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS Google Scholar

Download references

Acknowledgements

The authors thank Bharathiar University for providing the infrastructure to carry out this research work.

Author information

Authors and Affiliations

Department of Computer Science, Bharathiar University, Tamilnadu, India
R. Ranjani Rani, D. Ramyachitra & A. Brindhadevi

Authors

R. Ranjani Rani
View author publications
Search author on:PubMed Google Scholar
D. Ramyachitra
View author publications
Search author on:PubMed Google Scholar
A. Brindhadevi
View author publications
Search author on:PubMed Google Scholar

Contributions

All the three authors R. Ranjani Rani, D. Ramyachitra and A. Brindhadevi have contributed equally to this project by conducting the experiments, analyzing the results and writing the article.

Corresponding author

Correspondence to D. Ramyachitra.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rani, R.R., Ramyachitra, D. & Brindhadevi, A. Detection of dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach. Sci Rep 9, 11106 (2019). https://doi.org/10.1038/s41598-019-47468-y

Download citation

Received: 17 July 2018
Accepted: 11 July 2019
Published: 31 July 2019
Version of record: 31 July 2019
DOI: https://doi.org/10.1038/s41598-019-47468-y

Subjects

Abstract

Similar content being viewed by others

A multi-objective evolutionary algorithm for detecting protein complexes in PPI networks using gene ontology

Reliable identification of protein-protein interactions by crosslinking mass spectrometry

Reconstructing the evolution history of networked complex systems

Introduction

Methods

Elephant herd optimization

Clan updating operator

Clan separating operator

Experimental Results

Datasets

Performance measures

Complex similarity score (CSS)

Precision

Recall

Coverage ratio (CR)

F-Measure

Accuracy

Number of Clusters

Implementation and Discussion

Evaluation by noise removal

Evaluation by adding and removing random protein interactions

Evaluation by large PPI network WI-PHI dataset

Evaluation by parameter analysis

Evaluation by statistical significance

Evaluation by biological significance

Execution time

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links