Inference of epigenetic subnetworks by Bayesian regression with the incorporation of prior information

Jing, Anqi; Han, Jie

doi:10.1038/s41598-022-19879-x

Download PDF

Article
Open access
Published: 23 November 2022

Inference of epigenetic subnetworks by Bayesian regression with the incorporation of prior information

Anqi Jing¹ &
Han Jie¹

Scientific Reports volume 12, Article number: 20224 (2022) Cite this article

1783 Accesses
1 Citations
Metrics details

Subjects

Abstract

Changes in gene expression have been thought to play a crucial role in various types of cancer. With the advance of high-throughput experimental techniques, many genome-wide studies are underway to analyze underlying mechanisms that may drive the changes in gene expression. It has been observed that the change could arise from altered DNA methylation. However, the knowledge about the degree to which epigenetic changes might cause differences in gene expression in cancer is currently lacking. By considering the change of gene expression as the response of altered DNA methylation, we introduce a novel analytical framework to identify epigenetic subnetworks in which the methylation status of a set of highly correlated genes is predictive of a set of gene expression. By detecting highly correlated modules as representatives of the regulatory scenario underling the gene expression and DNA methylation, the dependency between DNA methylation and gene expression is explored by a Bayesian regression model with the incorporation of g-prior followed by a strategy of an optimal predictor subset selection. The subsequent network analysis indicates that the detected epigenetic subnetworks are highly biologically relevant and contain many verified epigenetic causal mechanisms. Moreover, a survival analysis indicates that they might be effective prognostic factors associated with patient survival time.

Blood-based DNA methylation in advanced Nasopharyngeal Carcinoma exhibited distinct CpG methylation signature

Article Open access 12 December 2023

Converging genetic and epigenetic drivers of paediatric acute lymphoblastic leukaemia identified by an information-theoretic analysis

Article 15 April 2021

Computational analysis of DNA methylation from long-read sequencing

Article 28 March 2025

Introduction

With the advance of high-throughput experimental techniques, a tremendous amount of genomic-wide omics data has been available, which revolutionizes the study of cancer by making it possible to discover potential biomarkers and biological mechanisms at the genome level. The classic view that cancer progression is driven by genetic changes including mutations and chromosomal abnormalities, and later on epigenetic alterations have been considered as crucial in the progression of cancer. Epigenetic alteration refers to the functionally relevant changes to the genome without changing DNA sequences¹, including DNA methylation, histone modification, etc. Such epigenetic alterations have been investigated in numerous studies^2,3, which revealed that they are likely to be responsible for the reduced or increased expression in DNA repair genes and be the cause of genetic instability characteristic of cancers in early cancer progression. DNA methylation is one epigenetic mechanism, which can alter gene expression by causing the stable silencing or activation of particular genes without changing DNA sequences. It remains throughout cell divisions and can be inherited by daughter cells lasting for multiple generations. It is of great interest to cancer study since it is potentially reversible and could be returned to normal function with appropriate drugs, which makes it an excellent target for anticancer therapies⁴. Moreover, the change in DNA methylation leading to an aberrant gene expression has been considered to play a crucial role not only in cellular development and differentiation but also in disease progression¹. Many studies have been conducted for the identification of aberrant DNA methylation sites in cancer^1,5,6, but there are fewer studies on the degree to which epigenetic changes might cause significant differences of gene expression in cancer. Thus it motivates us to discover the association between epigenetic changes and altered gene expression and investigate the varying level of DNA methylation that could drive differences in gene expression.

A recent database TCGA (The Cancer Genome Atlas)⁷ has profiled and collected multidimensional omics data at DNA, RNA, protein and epigenetic levels for hundreds of clinical tumors, making the integrative analysis of epigenetic mechanisms at the whole genome level possible^8,9,10. For example, Hinoue et al.⁸ identified four DNA methylation-based subgroups of colorectal cancer exhibiting characteristic genetic and clinical features, which provided novel insights regarding the role of subgroup-specific DNA hypermethylation in gene silencing. Varley et al.⁹ provided an atlas of DNA methylation across diverse and well-characterized samples and analyzed dynamic DNA methylation patterns in 82 cell lines and tissues, which discovered the role of DNA methylation in gene regulation and disease. Although the pattern of DNA methylation has been extensively investigated, how gene modules or pathways are deregulated through DNA methylation is far from understood. More specifically, approaches for simultaneously analyzing methylation and gene expression data need to be developed to discover how DNA methylation deregulates gene expression in cancer. Konno et al.¹¹ has presented a mathematical method to simultaneously quantify the association between DNA methylation and transcription, and identified several candidates for conferring resistance to anti-cancer drugs in gastrointestinal cancer. Costa et al.¹² have selected multi-omic data from TCGA comprising data of gene expression, methylation profiles, mutational patterns and clinical information, and identified the negative correlation between differentially expressed and methylated genes, which suggested that the genes are regulated by methylation alteration patterns in their promoters. Dabrowski et al.¹³ have built a method upon the Monte Carlo Feature Selection and Interdependencies Discovery (MCFS-ID) algorithm to search TCGA glioma dataset. They have discovered a set of significant gene expression profiles and DNA methylation sites, as well as their interdependencies, which was proved to be a good predictor of glioma patient’s survival. More specifically, they found that an important methylation feature (cg15072976) overlaps with the RE1 Silencing Transcription Factor (REST) binding site, intersect with the REST binding motif in human U87 glioma cells, and shows positive association with patient survival time. West et al.¹⁴ have proposed a method EpiMod to address whether differential DNA methylation is associated with a given phenotype of interest in the context of a protein interaction network. It started from constructing a weighted co-methylation network in the context of the human interactome model in which the edge weight represents the association between DNA methylation profiles in two connecting genes, and subsequently applied a local community detection algorithm (spin-glass) to identify differential methylation hotspots around differentially methylated genes by maximizing the sum of weights. They demonstrated the existence of epigenetic modules associated with phenotypes by applying the method to cancer and ageing. However, this approach was restricted to the DNA methylation data. Jiao et al.¹⁵ proposed a new approach FEM by expanding EpiMod by defining the edge weight as the combination of two statistical associations of co-methylation and co-expression. Encoding the two associations into edge weight allowed it to identify epigenetically deregulated modules in which genes showing coordinated differential methylation and differential expression. They identified the previously known deregulated pathway driving endometrial cancer development and an up-streamer of the well-known progesterone receptor tumour suppressor pathway. It is well acknowledged that the existence of anti-correlations between DNA methylation and gene expression, i.e., the changes in DNA methylation cause the silencing of gene expression. The method FEM detected the anti-correlated epigenetically correlated modules, however a recent study⁹ found the positive correlation between the two types of data that the increased methylation is associated with the higher level of gene expression. By assuming the existences of both negative and positive correlations, Ma et al.¹⁶ has proposed a multiple network algorithm EMDN by constructing differential co-expression and differential co-methylation networks respectively and subsequently identified the common modules appeared at both networks. EMDN can recognize both positively and negatively correlated modules. They demonstrated that the identified modules can serve as biomarkers to predict breast subtypes and estimate the survival time of patients. However, Wang et al.¹⁷ pointed out that only a small proportion of the alteration in DNA methylation leads to a corresponding change in gene expression at the same gene, therefore identifying gene modules restricted to the association between DNA methylation and gene expression at the same or adjacent genes may miss important links between the two changes. To overcome this limitation, they have proposed a multivariate regression framework NsRRR to identify relationships between any varying level of DNAmethylation and changes in expression of any genes. By considering expression levels of genes as responses of DNA methylation levels, they extracted a group of genes in which the expression level of a subset of genes could be regressed on the DNA methylation level of remaining genes.

Inspired by NsRRR, to further understand the relationship between gene expression and DNA methylation, we propose a novel framework to identify epigenetic subnetworks consisting of a set of genes with aberrant gene expression or aberrant DNA methylation level, in which the altered gene expression is deregulated by DNA methylation. Different from NsRRR¹⁷ which evaluated the association at the individual gene level, we extract a high level representation of regulatory scenarios underling both gene expression and DNA methylation in the form of gene modules, and then quantify the association between DNA methylation and gene expression at module level by a regression model. Since module-level analysis could increase the association signal and provide insight into the biological behaviours¹⁸, we expect that regression analysis at module level could provide complementary information to the analysis at single gene level¹⁷ and shed light on the discovery of new epigenetic mechanisms.

In this work, we propose an analytical framework for the discovery of epigenetic subnetworks in which aberrant gene expression is deregulated by DNA methylation levels. More specifically, we consider aberrant gene expression as the response of DNA methylation predictors. It starts with the discovery of predictor and response modules on a weighted differential network, and subsequently quantify the relationship between DNA methylation predictor modules and gene expression response modules via a Bayesian regression model with the incorporation of known protein-interaction priors. For each response module, the best subset of predictor variables is selected based on Bayesian information criterion (BIC). Statistical significance tests and biological relevance analysis show that the detected epigenetic subnetworks are significantly correlated with breast cancer genes and pathways and could be a starting point to uncover underlying epigenetic mechanisms in breast cancer and reveals potential therapeutic targets.

Here are the contributions of this work: (1) A novel analytical framework based on current statistical models is developed to detect an enriched set of epigenetic subnetworks by considering a set of highly correlated genes showing the pattern of differential co-expression/methylation instead of considering a single gene as a predictor or response. (2) This framework incorporates the prior biological knowledge as g-prior in a Bayesian regression model. It has detected more significantly correlated epigenetic subnetworks than alternative models without g-prior, indicating the effectiveness of encoding biological network information as g-prior in the selection of epigenetic subnetworks. (3) The network analysis for the detected epigenetic subnetworks reveals the direct causal mechanisms verified in the scientific literature. Finally, a survival analysis indicates that the derived modules might be effective prognostic factors associated with the patients’ survival time.

Methods

Overall framework

As shown in Fig. 1, our method consists of three main steps. First, differential networks are constructed for gene expression and DNA methylation data in the context of the human interactome network, respectively. Edge weights are assigned according to the differential levels of gene co-expression (co-methylation) between tumor and normal conditions (Fig. 1a). Secondly, the similarity matrices are constructed by mapping the edge weights to the values of matrix elements. The gene expression and DNA methylation modules, in which genes are involved in a regulatory pattern, are discovered by the nonnegative matrix factorization to the respective similarity matrix (Fig. 1b). Since multiple DNA methylations can drive a systematic change in an expression regulatory pattern, we consider DNA methylation modules as predictors and gene expression as responses. The relationships between DNA methylation predictor modules and gene expression response modules are quantified via a Bayesian regression model with the incorporation prior biological relatedness encoded as g-prior. For each response module, the best subset of predictor variables is selected based on BIC (Fig. 1c). Finally, the response modules with its corresponding predictors compose the epigenetic subnetworks, in which the differential expression regulatory pattern results from the varying level of DNA methylation of predictors.

Detection of predictor and response modules

Construction of differential networks

Since the differential expression (DE) has a cascading effect with the emergence of differential co-expression effects (DCE) due to the underlying biological network structures¹⁹, we combine the effects of DE and DCE in the context of a human protein-interaction network²⁰ to construct the differential gene expression network. The level of DCE between each pair of interacting genes in the protein-interaction network is evaluated. Pearson correlation coefficient $\rho _t$ and $\rho _n$ are used to evaluate the correlation between two genes $X$ and $Y$ in tumor and normal conditions, respectively. Then Fisher transformations are applied to the Pearson correlation coefficients. Recall that the Fisher transformation is defined as:

$$\begin{aligned} F(\rho )=\frac{1}{2}\text {ln}\frac{1+\rho }{1-\rho }. \end{aligned}$$

(1)

The statistic $Z$ is defined to assess the difference in gene correlation between tumor and normal conditions:

$$\begin{aligned} Z=\frac{F(\rho _{t})-F(\rho _{n})}{\sqrt{\frac{1}{n_{t}-3}+\frac{1}{n_{n}-3}}},\end{aligned}$$

(2)

where $n_{t}$ and $n_{n}$ denote the number of tumor and normal samples, respectively. The absolute value of $Z$ is used as the edge weight on a pair of interacting genes in the protein interaction network.

The statistical significance on the statistic Z is evaluated:

$$\begin{aligned} \text {p-value} =2 \times (1-\phi (|Z|)). \end{aligned}$$

(3)

After the Benjamini–Hochberg correction, gene pairs with adjusted p-value less than 0.05 are considered to be significantly differentially co-expressed between tumor and normal conditions. To filter out irrelevant genes and reduce false positives, genes that do not show significant DCE with differentially expressed genes in the network are removed.

Analogous, differential methylation data network is constructed in the same way.

Detection of predictor and response modules

The constructed differential networks provide valuable information about the gene regulatory patterns, since a larger value of the edge weight indicates a higher probability that the pair of genes are involved in a gene regulatory module. To detect gene modules, two similarity matrices $A_r[a_{ij}]$ and $A_p[a_{ij}]$ are constructed, based on the differential gene expression DNA methylation network of $n$ of genes ($i, j = 1, 2,\ldots , n$). The matrix element $a_{ij}$ represents the value of the edge weight between the gene $i$ and gene $j$ in the differential network. The discovery of module structures based on the similarity matrix can be formulated as a problem of symmetric nonnegative matrix factorization (NMF). The input similarity matrix $A_r[a_{ij}]$ $(A_p[a_{ij}])$ with size $n\times n$ can be factorized into a low rank matrix H that encodes the latent information embedded in the original similarity matrix, i.e.

$$\begin{aligned} A\approx H\times H^T, \end{aligned}$$

(4)

where H with size $n\times k$ gives the information on module indicators, i.e., the matrix element $h_{ij}$ in H indicates the confidence of assigning gene $i$ to module $j$, where $i=1, 2,\ldots , n$ and $j=1, 2,\ldots , k$. The factorization of matrix $A$ can be achieved by minimizing the loss function:

$$\begin{aligned} min_{H\ge 0}\parallel A-H\times H^T\parallel , \textit{ subject to}\ H\ge 0. \end{aligned}$$

(5)

We solve this problem using the algorithm SymNMF proposed by Kuang et al.²¹, which aims to detect cluster structures where data is embedded in non-linear structure. We employ symNMF to factorize the symmetric matrices for gene expression and DNA methylation data, and exploit the module structure embedded in the graph representation. The output H contains the information on module memberships. Specifically, for each row $h_i$ in $H$, gene $i$ is assigned into the $k_{th}$ clusters if $h_{ik}$ is the maximum element.

Significance test leading to the optimal selection of predictor and response modules

The rank $K$ (i.e., the column number of matrix $H$) determines the number of modules, which is a key parameter that needs to be explored. The choice of $K$ in NMF is often an application-dependent and long-standing problem²². In this work, given the weighted differential network where the edge weight indicates the extent of the correlation between two genes, the task is to detect densely connected modules with high modularity. Note that not all detected modules by NMF would have above-average modularity, since it is possible that non-correlated genes would be grouped into a cluster representing isolated associations. Thus the modularity of the detected modules is evaluated by calculating module density²³:

$$\begin{aligned} density(M_{ik})=\frac{\sum _{p\in M_{ik}, q\in M_{ik}}A[a_{pq}]}{|M_{ik}|\times (|M_{ik}|-1)}, \end{aligned}$$

(6)

where $M_{ik}$ indicates the $i_{th}$ module under the rank $k$ and $A$ is the similarity matrix.

A permutation test is performed to assess the statistical significance of module density by randomly generating modules with the same size as the detected module in the background differential network. This procedure is repeated 1000 times, i.e., for each detected module, 1000 random modules are generated. Under the null hypothesis, statistical significance is implied if the density of random modules is equal to or greater than that of the observed modules. Hence, the p-value is used to indicate the significance level of the density for the observed module $M_{ik}$, averaged over 1000 experiments:

$$\begin{aligned} p(M_{ik})=\frac{\sum _{b=1}^{b=1000}I\{density(M_{ik}^{b})\ge density(M_{ik})\}}{1000}, \end{aligned}$$

(7)

where $density(M_{ik}^{b})$ indicates the density score of the $b_{th}$ permuted module. If the adjusted p-value is less than 0.05 after the Bonferroni correction, the observed module is considered to be statistically significant.

We expect that with an appropriate value of the rank $K$, most significant modules showing local regulation patterns would be detected. A wide range value of $K$ is explored and we select the value of $K$ leading to the largest number of detected significant modules as the optimal $K$. This procedure is performed for both DNA methylation and gene expression data. The optimal values of $k_p$ and $k_r$ are respectively obtained for DNA methylation predictor modules and gene expression response modules, respectively. With the optimal values of $k_p$ and $k_r$, modules with the adjusted p-value less than 0.05 are selected as DNA methylation predictor modules and gene expression response modules, respectively.

Module quality measures

The module density measures whether or not genes in identified modules are densely connected. The other measure of module quality, separability score²⁴, is employed to evaluate whether or not a detected module is well separated from other modules in the differential network. The separability score between two modules $M_i$ and $M_j$ is determined by the inter-module adjacency and intra-module adjacency:

$$\begin{aligned} separability(M_i, M_j)=1-\frac{interAdj(M_i,M_j)}{\sqrt{density(M_i)\times density(M_j)}}, \end{aligned}$$

(8)

where $density(M_i)$ and $density(M_j)$ are the intra module densities (defined in (6)) for module $M_i$ and $M_j$, respectively. The inter-module adjacency $interAdj(M_i, M_j)$ is defined as:

$$\begin{aligned} interAdj(M_i, M_j)=\frac{\sum _{p\in M_i}\sum _{q\in M_j}A[a_{pq}]}{n_i n_j}, \end{aligned}$$

(9)

where $A$ is the similarity matrix and $n_i$ and $n_j$ are the number of genes in $M_i$ and $M_j$, respectively. The closer the separability score is to 1, the more separated are the $M_i$ and $M_j$. The permutation test for the observed separability score is performed to obtain the significance level. The separability and density scores measure the homogeneity and separateness of the detected modules²⁴. We use these two measures to validate if the modules are well detected.

Detection of epigenetic subnetworks

In this section, we aim to identify the relationships between predictors and responses by detecting the set of predictors that best explains the variation in expression in the response module. We use the eigengene²⁵ as the representative of each module in one synthetic profile, since it allows to relate the module to the clinical trait of interest in an easy way and it can also be used as a feature in more complex predictive models including the Bayesian inference model²⁶. To select the best subset of predictors for each response, Bayesian linear regression model with an informative g-prior is employed to compute all possible regression models for a response module. The biological relatedness between predictors and responses is encoded as an informative g-prior to guide the search of association between predictors and responses. The best subset of predictors for each response is selected according to the criterion of Bayesian information criterion (BIC).

Module eigengene

We treat each modules as a single unit by constructing the representative eigengene²⁵. The eigengene is defined as the first principal component based on singular value decomposition²⁷. In detail, let $Y=(y_{il})$ denote the gene expression profile for a response module, where $i=1,2,\ldots ,n$ denotes the index of genes and $l=1,2,\ldots ,m$ corresponds to the tumor samples. The expression profile for each gene, i.e., each row of Y, is standardized to have the mean 0 and the variance 1. The singular value decomposition of $Y$ is represented as:

$$\begin{aligned} Y=UDV^T, \end{aligned}$$

(10)

where $U$ is an orthogonal matrix with size $n\times m$ and the columns of $U$ are referred to as the left-singular vectors. $V$ is the orthogonal matrix with size $m \times m$ and the columns of $V$ and $D$ is an $m \times m$ diagonal matrix of the singular values. The first column of $V$ is referred to as the module eigengene. Similarly, eigengenes of DNA methylation predictor modules are obtained from methylation profiles in the same way.

To evaluate if the module eigengene can represent the module profile well, we calculate the proportion of variances explained by the module eigengene²⁸ as follows:

$$\begin{aligned} varExplained(E) = \frac{|d^1|^2}{\sum _{j}|d^j|^2}, \end{aligned}$$

(11)

where $d^1$ is the first element in the diagonal matrix $D$. The large value of $varExplained$ indicates that the module eigengene is properly generated and it can represent the profile well.

Bayesian regression with g-prior

For the ease of analysis, we assume that the response module is associated with a set of predictors via a linear regression model. Given a response module eigengene $Y_{i}$ and a set of predictor module eigenegenes $X_{\gamma }$, the prediction error

$$\begin{aligned} \varepsilon _{i\gamma }= Y_{i}-\beta _{i\gamma } \times X_{\gamma }, \end{aligned}$$

(12)

is assumed to be independent and identically distributed with mean 0 and variance $\sigma ^2$, where the parameter $\beta _{i\gamma }$ indicates the vector of regression coefficients. Assuming that the response $Y_{i}$ conditional on $X_{\gamma }$ is subject to a multivariate normal distribution:

$$\begin{aligned} Y_{i}|X_{\gamma },\beta _{i\gamma },\sigma ^{2} \sim Normal\bigg (\beta _{i\gamma } X_{\gamma }, \sigma _{i\gamma }^2I\bigg ), \end{aligned}$$

(13)

where $\sigma _{i\gamma }^2 I$ is a variance co-variance matrix that has error $\sigma _{i\gamma }^2$ on the diagonal and zeros for the remaining elements.

We employ Zellner’s g-prior²⁹ to include the prior biological relatedness between responses and predictors. Intuitively, the g-prior controls the uncertainty in the prior belief relative to the variance of the observations around the mean, and the prior distribution of $\beta _{i\gamma }$ conditional on variance $\sigma _{i\gamma }^2$ is formulated as:

$$\begin{aligned} \beta _{i\gamma }|\sigma _{i\gamma }^2 \sim Normal\bigg (\beta _{i\gamma }^0, g_{i\gamma }\bigg (X_{\gamma }^TX_{\gamma }\bigg )^{-1}\sigma _{i\gamma }^2\bigg ),\end{aligned}$$

(14)

where $\beta _{i\gamma }^0$ is the initial guess of mean vector, the term $X_{\gamma }^TX_{\gamma }$ is the variance-covariance matrix that provides a prior covariance structure, and $\sigma _{i\gamma }^2$ is data-dependent covariance matrix that can be scaled by a user-defined positive factor $g_{i\gamma }$. The prior information can be integrated into the model by changing the parameter in the prior distribution, thus we can propose a prior guess of the vector of regression coefficients and encode the corresponding prior belief in $g_{i\gamma }$.

The posterior distribution of $\beta _{i\gamma }$ is given by

$$\begin{aligned} p\bigg (\beta _{i\gamma }|\sigma _{i\gamma }^2,X_{\gamma },Y_{i})\sim N(\frac{g_{i\gamma }}{g_{i\gamma }+1}\bigg (\frac{\beta _{i\gamma }^0}{g_{i\gamma }}+\beta _{i\gamma }^{ols}\bigg ), \frac{\sigma _{i\gamma }^{2}g_{i\gamma }}{g_{i\gamma }+1}\bigg (X_{\gamma }^TX_{\gamma })^{-1}\bigg ),\end{aligned}$$

(15)

where $\beta _{i\gamma }^{ols}=(X_{\gamma }^TX_{\gamma })^{-1}X_{\gamma }^TY_{i}$ is the OLS estimator of $\beta _{i\gamma }$, and the vector of regression coefficients $\beta _{i\gamma }$ can be estimated by the posterior mean and prior $g_{i\gamma }$:

$$\begin{aligned} \widetilde{\beta _{i\gamma }} = \frac{g_{i\gamma }}{1+g_{i\gamma }}+\frac{1}{1+g_{i\gamma }}\beta _{i\gamma }^{ols}.\end{aligned}$$

(16)

The posterior distribution of $\sigma _{i\gamma }^2$ follows the inverse-gamma distribution and can be estimated as

$$\begin{aligned} p\bigg (\sigma _{i\gamma }^2|X_{\gamma },Y_{i}\bigg )\sim IG\bigg (\frac{n_{\gamma }}{2},\frac{SSR_{i\gamma }}{2}+\frac{\bigg (\beta _{i\gamma }^0-\beta _{i\gamma }^{ols}\bigg )X^TX\frac{1}{1+g_{i\gamma }}\bigg (\beta _{i\gamma }^0-\beta _{i\gamma }^{ols}\bigg )}{2}\bigg ),\end{aligned}$$

(17)

where $n_{\gamma }$ is the number of predictors in $X_{\gamma }$ and $SSR_{i\gamma }$ is the sum of squares of the residuals of $\beta _{i\gamma }^0$.

Encode prior biological information as g-prior

We evaluate the biological relatedness between each response $Y_{i}$ and each predictor $X_{j}$ based on the human protein interaction network²⁰. The greater number of interactions between genes in predictor and genes in response, the higher degree of biological relatedness between them. We define the biological relatedness $r_{i\gamma }$ between $X_{j}$ and $Y_{i}$ as:

$$\begin{aligned} r_{ij}=\frac{2E_{_{ij}}}{N{_{ij}}}, \end{aligned}$$

(18)

where $N{_{ij}}$ is the total number of genes in predictor $X_{j}$ and response module $Y_{i}$. The parameter $E{_{ij}}$ denotes the number of interactions between genes in the predictor and the response.

Then a weight parameter $\mu$ is added to control the relative influence of the prior biological relatedness in the Bayesian regression model:

$$\begin{aligned} g_{ij}=\mu r_{ij}. \end{aligned}$$

(19)

When $\mu =0$, we treat all predictors equally and no prior information is included in the model. The larger the value of $g_{ij}$ is, the more confident we are about that the predictor $X_j$ is associated with response $Y_i$.

Two factors, the prior coefficient vector $\beta ^0_{i\gamma }$ and the scalar $g_{i\gamma }$, need to be set. In practice, we set $\beta _{i\gamma }^0$ to be a vector with all elements having values of zeros, which reflects our prior belief in the very subtle dependence between the predictors and responses. The parameter $g_{i\gamma }$ is originally formulated as a constant to control the confidence in the coefficient $\beta ^0_{i\gamma }$. Specifically, a large value of $g_{i\gamma }$ leads the regression coefficients to be centered around $\beta ^{ols}_{i\gamma }$. On the other hand, values of $g_{i\gamma }$ with a small value leads to the solution centered around $\beta ^0_{i\gamma }$. We extend the formulation of $g_{i\gamma }$ as a vector $\mathbf {g_{i\gamma }}$ to allow for different levels of the control in the elements in $\beta ^0_{i\gamma }$. Each entry in $\mathbf {g_{i\gamma }}$ corresponds to one predictor, controlling the confidence in the prior belief relative to the variance of the observations around the mean. In this case, the vector $\mathbf {g_{i\gamma }}$ constructed for response $Y_i$ and the predictor set $X_{\gamma }$ is composed of $[g_{ix}]$, where $x$ indicates the index of predictors in the set $X\gamma$.

Best predictor subset selection based on BIC

Assuming that $k_p$ predictors are obtained, there are totally $2^{k_p}$ combinations of predictor variables for each response. As a first choice, we use BIC as a measure to select the best model. Recall that BIC is defined as:

$$\begin{aligned} BIC=-2\text {ln}({\hat{L}})+k\text {ln}(n),\end{aligned}$$

(20)

where $n$ is the number of observations, $k$ is the number of parameters estimated by the model and ${\hat{L}}$ is the maximum likelihood value of the model. The expected value of BIC is calculated as:

$$\begin{aligned} E\bigg [BIC_{i\gamma }\bigg ]=nE\bigg [ln(\sigma ^2_{i\gamma })\bigg ]+k_{\gamma }ln(n), \end{aligned}$$

(21)

where n is the number of samples and $k_\gamma$ is the number of predictors in the set $\gamma$. The expected value of $\ln (\sigma ^2_{i\gamma })$ is calculated as:

$$\begin{aligned} E\bigg [ln(\sigma _{i\gamma }^2)\bigg ]=Digamma\bigg (\frac{n}{2}\bigg )-In\bigg (\frac{SSR_{i\gamma }}{2}+\frac{\bigg (\beta _{i\gamma }^0-\beta _{i\gamma }^{ols}\bigg )G_{i\gamma }X_{\gamma }^TX_{\gamma }G_{i\gamma }\bigg (\beta _{i\gamma }^0-\beta _{i\gamma }^{ols})}{2}\bigg ), \end{aligned}$$

(22)

where $G_{i\gamma }$ is a square matrix in which diagonal elements are $\sqrt{\frac{1}{1+\mathbf {g_{i\gamma }}}}$ and the remaining elements are all zeros, and $SSR_{i\gamma }$ is the sum of squares of the residuals of the ordinary least squares $\beta _{i\gamma }^{ols}$. Given a response module, the combination of predictors with the smallest expected value of BIC would be selected.

Survival analysis

We hypothesized that the detected modules or subnetworks might be effective prognostic parameters that are associated with the survival time of patients. Thus, a survival analysis was performed. Since the coefficient in a Cox regression model is related to the hazard, i.e., a positive value represents a worse prognosis and a negative value indicates a positive association with survival time³⁰. Thus, we devise the prognostic index scores for patients based on the coefficients in the Cox regression model of each module or subnetwork. The prognostic index score for a patient $i$ with a response or predictor module $k$ is defined as:

$$\begin{aligned} PI_{ki} = \beta ^{cox}_k E_{ki}, \end{aligned}$$

(23)

where $\beta ^{cox}_k$ is the Cox regression coefficient for module $k$ and $E_{ki}$ is the value of eigengene of module $k$ for patient $i$.

For a subnetwork $k$, the multivariate Cox regression is performed and the prognostic index score for the patient $i$ is defined as:

$$\begin{aligned} PI_{ki} = \sum _{c\in k}{\beta ^{cox}_{c}E_{ci}}, \end{aligned}$$

(24)

where $\beta ^{cox}_c$ is the Cox regression coefficient for a module variable $c$ in the subnetwork $k$ and $E_{ci}$ is the value of module eigengene$c$ for the patient $i$.

Then we divide patients into two groups based on the prognostic index scores: low-risk (the PI score $< 30{\text {th}}$ percentile of the entire PIs) and high-risk (the PI score $> 70{\text {th}}$ percentile of the entire PIs). Kaplan-Meier estimator is used to generate the survival curves for two groups, followed by the Log-rank test to the significance level on the difference of the survival time in two groups.

Results

Simulation study

To test if the proposed Bayesian regression model can identify true relationships between predictors and responses, we first applied the method to the simulation datasets. We simulated predefined epigenetic subnetworks consisting of a set of predictors and a response. The result of simulation study is presented to characterize the ability of our method in detecting true epigenetic subnetworks.

Simulation dataset

We generated three sets of studies corresponding to different strengths of association $c_r=(0.3, 0.5, 0.7)$ within response modules. In each study, we generated four predictor modules, and simulated different levels of associations from $\Phi = (0.03, 0.05, 0.1, 0.2, 0.3)$ between predictors and responses to detect the true relationships with respect to different strengths of associations. Since the structure of epigenetic subnetworks was known, the performance of the model can be evaluated by comparing the detected structure to the predefined structure.

First, we simulated four DNA methylation predictors $x_1, x_2, x_3, x_4$ corresponding to different correlation signals $c_p = (0.3, 0.5, 0.3, 0.5)$, with the same size $n \times p$, where $n$ and $p$ indicate the number of samples and variables, respectively. In practice, we set $n = 200$ and $p = 25$. Let $x_i^m$ denote the methylation level of the variable $m$ in the $i_{th}$ predictor module, which is generated as:

$$\begin{aligned} x_i^m\sim N(0, \Sigma _i^m), \end{aligned}$$

(25)

where $\Sigma _i^m \sim Inverse{-}Wishart(60, (1-c_i)I+c_iJ )$, $c_i$ is the association signal taken from $c_p$, $I_{p\times p}$ is the identity matrix and $J_{p\times p}$ is a matrix with all entries as 1.

We generated three response modules $y_1, y_2, y_3$ with size $n \times q$, corresponding to different levels of correlations, in order to test if different levels of correlation within responses would affect the outcome of our method. In practice, we set $n = 200$ and $q = 30$. Thus, three response modules with correlation signals $c_r = (0.3, 0.5, 0.7)$ were generated in a similar way as the predictor modules.

We further simulated subnetwork structures by assuming that specific predictors and responses contribute to the non-random associations. Let $x_j$ and $y_i$ denote the profile of the $j_{th}$ module in $x$ and the $i_{th}$ module in $y$, respectively. The dependency between response $y_i$ and a specific set of predictors is added. Thus the new profile of the $i_{th}$ response $y^S_{i}$ was obtained as follows:

$$\begin{aligned} y^S_{i}=y_{i}+\sum _{i\in S_{i}}{x_iA}+E, \end{aligned}$$

(26)

where $S_{i}$ indicates the set of predictors having associations with $y_i$, $A$ is a matrix of size $p\times r$ with elements carrying the association signal from $\Phi = (0.03, 0.05, 0.1, 0.2, 0.3)$, and $E$ is the random noise matrix with size $n\times q$ generated from the independent normal distribution with mean 0 and variance 1. In practice, we set $S_1$, $S_2$ and $S_3$ as $\{x_1\}$, $\{x_2\}$, $\{x_1, x_2\}$, respectively, which means that the response modules $y_1$ and $y_2$ are regulated by predictors $x_1$ and $x_2$ respectively, while $y_3$ is regulated by both $x_1$ and $x_2$. No predictor is specified for response module $y_4$.

In addition, we generated a gene-interaction network $G$ to simulate the biological relatedness between responses and predictors by using two parameters: $p_c$ is the probability of the connection between the predictor and response that belong to the same epigenetic subnetwork, and $p_{cn}$ is the probability of the connection between the predictor and response not in a a same epigenetic subnetwork. We set $p_c = 0.1$ and $p_{cn}=0.05$ such that predictor modules and response modules in the same subnetwork are relatively densely connected, whereas there are fewer links in the rest of epigenetic subnetworks.

Results and discussion

Three sets of dataset are simulated corresponding to different levels of correlations within response modules. We applied our method to the three datasets starting with the construction of module eigengenes. The simulated methylation profile of two predictor modules across 200 samples with correlation signal 0.3 and 0.5 were presented in Supplementary Fig. 1. An intuitive illustration of eigengene is shown by the black line in the figure and it is highly correlated with the methylation profiles in the module.

Table 1 Simulation results on three datasets by two methods.

Full size table

Table 2 Breast cancer genes in detected epigenetic subnetworks.

Full size table

For comparison, we applied the standard regression model without incorporation of prior knowledge to the simulated dataset. Results on three datasets $y_1$, $y_2$ and $y_3$ by two methods are shown in Table 1a, b and c, respectively.

Table 1a shows the result of identification on the first dataset $y_1$, where ’g-prior’ indicates the result of our method with the incorporation of g-prior and ’no prior’ indicates the result of the standard regression without prior. A wide range of association strengths between responses and predictors were specified from 0.03 to 0.3. When $a = 0.03$, a very weak association was specified between the response and predictor. In the case where the single predictor is specified to the response, our method can detect almost all true relationships. When stronger associations $a = 0.05, 0.1, 0.2, 0.3$ are specified, our method identified all the true relationships on $y_1$, $y_2$ and $y_3$. However, for the method without the incorporation with g-prior, it resulted in several false positives and cannot identify true relationships as correctly as our method. For example, the false positive $x_2$ was detected by the standard model for the response $y_2^2$ and $y_2^3$, which was not specified in the relationship. Supplementary Fig. 2 shows the two examples of fitted regression models constructed by our method.

The simulation analysis demonstrated that our model can identify subnetworks correctly even in the case where a very weak association is specified, while the standard regression model without g-prior resulted in multiple false positives. The g-prior in our method worked as a modifier on the shrinkage incurred on each predictor parameter. A larger value of g-prior corresponds to a smaller shrinkage incurred on the corresponding regression coefficient, making the corresponding variable less likely to be shrunk out of the model. It is worth nothing that it only modified the degree of shrinkage of a predictor, but not the correlation between responses or the order in which the predictors are selected by the model.

Case study

Dataset

We collected sample matched level-3 Illumina 450k methylation data and HiSeq RSEM gene-normalized RNA-seq data of breast cancer from TCGA⁷. We followed the strategy used by Jiao et al.¹⁵ to assign the methylation value to a given gene, which was introduced in “Construction of differential networks” section in detail. After data preprocessing, we generated the sample matched gene expression and DNA methylation profiles in 786 invasive ductal carcinoma tumor samples as well as 84 normal samples.

In addition, TCGA provides the corresponding clinical information including the patient status (alive or dead), the survival days (days to last follow-up or days to death). Such information was also collected to perform the survival analysis.

The information of the protein–protein interaction (PPI) was used in the inference procedure. It refers to the physical contact of high specificity between two proteins and it has been studied from multiple perspectives such as molecular dynamics, signal transduction and so on³¹. We downloaded the PPI network from the Protein Interaction Network Analysis (PINA) platform²⁰, which integrates and annotates the data from six public PPI databases (MINT, IncAct, DIP, BioGRID, HPRD, and MIPS/MPact). The network consists of 166,776 edges and 16,182 nodes.

Discovery of predictor and response modules

Differential gene expression and DNA methylation networks were constructed by evaluating the differential co-expression and co-methylation in the PPI network. Two respective similarity matrices were generated by mapping the edge weight in the differential networks into the value of matrix elements, where an element indicates the probability that two genes may be involved in a regulatory pattern, i.e., the same module. Next, SymNMF was performed on these two similarity matrices to discover predictor and response modules. A wide range of candidate values from 5 to 70 for the number of modules $K$ was explored. We expected that with an appropriate value of K, the most number of modules showing significant high-density would be detected. Given a candidate value for K, density scores were calculated for detected modules. By performing the significance test, the statistical significance of the module density was evaluated. Figure 2 shows the number of the predictor and response modules showing significant density with respect to the parameter $K$. We observed that with the increase of $K$, the number of significant predictor and response modules increases to a maximum point followed by a decrease in the number of modules. The maximum number of the significant predictor and response modules were detected when $k_p$ and $k_r$ are set to 50 and 49, respectively. When $K$ exceeded the optimal value, the number of significant modules did not grow with the increase of $K$ any more. It indicated that in the case where K is greater than the optimal value, dissimilar genes were grouped into more non-correlated modules. Finally, 21 significant predictor modules and 39 significant response modules were detected with adjusted p-values less than 0.05.

Module quality measures

Density-based measure

As we discussed, we employed the module density to select the significant modules which remain densely connected in the differential networks. Significance levels of density statistics were measured by a permutation test. We showed the result of the permutation test in Fig. 3, where we presented the density of observed modules as well as the distribution of the densities of 1000 randomly modules. From Fig. 3, we can see that the density scores for detected modules are significantly higher than random scores.

Separability-based measure

Next we evaluated the separability of identified modules to test if modules remain distinct from others. Separability scores and corresponding p-value were calculated to evaluate the significance levels of the separability for each pair of identified modules. By setting the threshold of p-value as 0.05, we observed that all pairs of predictors and responses are of significant separability. The p-values of separability scores for both predictor and response modules were attached in Supplementary Appendix A. Two heatmaps (Supplementary Fig. 3, generated by ggplot2³² with R³³) shows the separability and the density scores between each pair of modules, where the off-diagonal blocks represent the separability scores and the diagonal blocks represent module density. Evaluations on the density and separability revealed that the modules are well defined and genes within a module remain densely connected as well as distinct from other modules.

Other measures

We calculated $varExplained$, the proportion of the variance explained by module eigengenes, to check if the module profile is well represented by the eigengene. Supplementary Fig. 4 shows the boxplots of $varExplained$ for predictor and response modules. The median values of $varExplained$ for predictors and responses were 0.82 and 0.80, respectively, which indicated the eigengene can represent a large proportion of variance of the module profile.

In addition, we evaluated if the detected predictor and response modules are correlated with patient survival time. We selected the right-censoring tumor samples, i.e., patients with known death time, to measure the correlation between the module eigengene and the survival time of patients. The Pearson correlation coefficients and the corresponding significance levels by the permutation with z-test were calculated. The modules with p-value less than 0.05 are considered to be associated with the patient survival time. We found that 13 out of 39 response modules are significantly correlated with the patient survival time, while no significant correlations between predictors and survival time were found. Supplementary Figure 5 showed scatterplots between eigengenes and the patient survival time for the 13 response modules.

Discovery of epigenetic subnetworks

The Pearson correlation between the profiles of DNA methylation and gene expression within each detected subnetwork are calculated to evaluate the performance. The p-value on the correlation coefficients after adjustment by a permutation test was obtained for each detected subnetwork. The Fisher’s meta analyzed p-value was obtained by combining the set of p-values for all subnetworks into one meta p-value using Fisher’s combined probability test to evaluate the overall performance. The weight parameter $\mu$ on g-prior was set to control the relative influence of prior biological relatedness to the discovery of epigenetic subnetwork. We measured the sensitivity of our method to the weight $\mu$ from a wide range of candidates $[0, 0.2, 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10]$. Different sets of epigenetic subnetworks were detected and the Fisher’s meta analyzed p-values were obtained with respect to different values of the parameter $\mu$.

Figure 4 shows the negative logarithm of Fisher’s meta-analyzed p-value for each $\mu$. When $\mu =0$, no prior information was incorporated. As the value of $\mu$ increases, the performance increases to a certain point then followed by a decrease. The best performance was obtained when $\mu =1$, therefore we selected the value of 1 leading to the most significant correlation within subnetworks as the optimal weight value on the prior. We noticed that in the case where $\mu =0$, not all detected subnetworks show a significant correlation, which indicated that the incorporation with g-prior contributed to the discovery of significant epigenetic subnetworks.

Table 3 Regression results.

Full size table

Table 4 The result of pathway enrichment tests in detected epigenetic subnetworks.

Full size table

For each response module, the best subset of predictors was selected based on BIC. We assessed the significance level of regression coefficients in each detected subnetwork to evaluate whether the slope of the regression line differs significantly from zero. Table 3 shows the detailed regression coefficient and the corresponding significance level of each model. Except for the response module y18 and y21, all models show a significant relationship between the predictor and the response. Thus, we removed the two subnetworks and finally 37 epigenetic subnetworks were kept.

We calculated the confidence score (Table 3) of each selected predictor $x_i$ for response $y_j$, which measures the proportion of variance explained by $x_i$ and the confidence in being a true regulation.

Follow up analysis

Pathway enrichment test and network analysis

To determine the biological functional relevance of the detected epigenetic subnetworks, we performed the pathway enrichment test using reference pathways downloaded from MSigDB³⁴, including KEGG³⁵, Reactome³⁶, Biocarta³⁷, GO³⁸ and Canonical pathways (CP). The subnetwork is considered to be enriched in a reference pathway if a p-value < 0.05 is obtained by Hypergeometric test after correction. First, we examined the functional homogeneity of the detected subnetworks. A set of genes is defined as functional homogeneity if they are enriched in at least one GO category³⁸. We found that all detected subnetworks exhibit significant functional homogeneity since they are all enriched in at least one reference set in GO. Table 4 shows the ratio of enriched subnetworks in each database. All detected subnetworks are enriched in at least one reference pathway from Reactome and CP, and 35 out of 39 subnetworks (90%) and 28 out of 39 subnetworks (72%) were enriched in KEGG and Reactome pathways, respectively. In addition, we evaluated the proportion of reference sets enriched for epigenetic subnetworks (Table 4) and found that 42.3%, 41.9%, 50.9%, 43.5% and 49.8% of reference sets in GO, KEGG, CP, Reactome and Biocarta were enriched for detected subnetworks, respectively. The results revealed that the detected epigenetic subnetworks are of great biological relevance.

Next we asked if the detected subnetworks were related to cancer, especially the breast cancer. We examined whether the genes in detected epigenetic subnetworks are cancer-related biomarkers. We collected 2027 cancer genes from allOnco database (http://www.bushmanlab.org/links/genelists), and 738 breast cancer driver genes from intogen³⁹ and OncoSearch⁴⁰. On average, 20% of genes in the detected subnetworks were cancer genes and 9% were breast cancer genes. Table 2 shows the breast cancer genes in detected epigenetic subnetworks, where the third column ’ratio’ represents the ratio between the number of cancer genes and the module size. We found that, except for subnetwork 4, there is at least one breast cancer gene in each detected subnetwork, which reveals that the epigenetic subnetworks are related to breast cancer. In addition, multiple important breast cancer genes were detected in the epigenetic subnetworks, like gene ERBB2 in subnetwork 19, a known proto-oncogene, that encodes HER2, a member of the human epidermal growth factor receptor. Genes TP53BP1 and TP53BP2 were also detected and encode a member of the ASPP (apoptosis-stimulating protein of p53) family of tumor suppressor p53 interacting proteins.

We took the epigenetic subnetwork 16 as an example and performed an extensive analysis for it. The subnetwork 16 contained 20 cancer genes and 9 breast cancer genes (CDK2, PRLR, CDH1, ERBB3, TP53BP2, SRC, MBIP, KDM1A and SERPINE1) and it was enriched in 12 KEGG pathways including two pathways that are specific to the breast cancer: KEGG cell cycle and KEGG P53 signalling pathway. Figure 5 shows the network representation of subnetwork 16, including genes involved in KEGG pathways and genes showing correlations larger than 0.3. Genes acting as predictors were drawn as circles and responses were drawn as squares. Multiple epigenetic mechanisms were detected between predictors and responses. We found that the mechanism between SFN and CDK2 in subnetwork 16 was supported by observations that SFN is a frequently hypermethylated gene^41,42 emerging as a new inhibitor of CDK2 in breast cancer cells⁴³. In addition, SFN has an important function in preventing breast tumor cell growth⁴³ which suggests that SFN may play a therapeutic potential role in cancer prevention by targeting epigenetic machinery. We also observed that CCNA1 has been detected as an epigenetic regulator in Fig. 5. Evidence in the literature showed that the differential methylation pattern of CCNA1 was associated with the treatment response in breast cancer and could potentially be a predictive marker to anthracycline/mitomycine sensitivity⁴⁴. Moreover, multiple researches demonstrated that UHRF1 interacting with various proteins in multiple pathways results in the silencing of key tumor suppressor genes in breast cancer^34,45. In Fig. 5, we observed that the methylation pattern of UHRF1 was highly correlated with the expression of multiple breast cancer genes including ERBB3, TP53BP2 and PRLR. In addition, genes like PLCG1 and PTPN6 in subnetwork 16 were also likely to be epigenetic regulators, which was supported by several researches^46,47. Overall, these findings supported the idea that our method successfully detects epigenetic subnetworks containing verified epigenetic mechanism, and the detected subnetworks could be a starting point to uncover the underlying epigenetic mechanisms.

Survival analysis

We hypothesized that the profiles of gene expression or DNA methylation in detected modules and subnetworks might be effective prognostic parameters associated with survival time. As introduced in Method, we derived the prognostic index score for each patient based on the module profiles. The patients were divided into high-risk and low-risk groups and we performed the log-rank test to validate if the survival times in the two groups are significantly different. First, the survival analysis was performed on predictor and response modules. The results showed that 8 of 39 response modules (Fig. 6) can divide patients into two groups in which the survival time of patients of high-risk and low-risk are significantly different. However, no groups in predictor modules showed significantly different survival time. Next the multivariate Cox proportional regression was performed on epigenetic subnetworks and we detected that 11 of 37 subnetworks (Fig. 7) were significantly associated with survival time. In addition to the detected 8 significant response modules, 3 more responses in the subnetworks with the incorporation of DNA methylation predictors (subnetworks 1, 6, 36) showed the significant association with survival time, which indicated that the combinations of DNA methylation predictors and responses in the 3 subnetworks improve the classification of patients. It revealed that predictors and response in these 3 subnetworks jointly impact on the survival time.

Performance comparison

Ma et al.¹⁶ detected 26 epigenetic modules by EMDN using TCGA breast cancer data and calculated the ratio of enriched modules as well as the ratio of enriched reference pathways. The method EMDN was compared with two other methods, EpiMod and FEM. They showed that the results detected by EMDN are more enriched than those achieved by EpiMod¹⁴ and FEM¹⁵. Since the breast samples used in EMDN, FEM, EpiMod¹⁶ were identical to the data in our paper, we can compare the performance of epigenetic subnetworks detected by EMDN directly. About 40% to 50% of subnetworks detected by EMDN, EpiMod and FEM were enriched in at least one reference set in GO, KEGG, CP, Reactome and Biocarta, which is much lower than the ratios achieved by our method. In our method, all of subnetworks detected were enriched in GO, CP and Biocarta, and 89.7% and 79.8% of subnetworks were enriched in KEGG and Reactome, respectively. However, one should note that EMDN did not take protein-interactions into account while EpiMod and FEM employed the PPI network, thus we conclude that incorporation with the biological interaction network may contribute to the discovery of biologically-relevant epigenetic subnetworks. The comparison with EMDN revealed that our framework with incorporation with PPI networks can detect more enriched subnetworks than EMDN, EpiMod and FEM.

Discussion

Recent technology developments have enabled simultaneous genomic profiling of biological samples on multiple platforms, resulting in genome-wide DNA methylation and gene expression data. However, a systematic analysis between the two types of data for discovering biologically relevant combinatorial patterns is currently lacking. In this chapter, we present a method to evaluate the association between gene expression and DNA methylation at the module level by Bayesian regression with the incorporation of prior gene interaction knowledge. We first identified gene expression responses and DNA methylation predictors on a weighted differential expression and methylation networks respectively. Through a significance test, modules passing a p-value threshold were considered as predictors or responses. Density-based and separability-based measures in the significance test were used to validate if detected modules are densely connected and well separated from others. The results showed that the detected modules are well defined and that genes within a module show homogeneity and separability. Then we considered an eigengene as the representative of module profiles for a large proportion of variance of module profiles. With the incorporation of prior gene interaction networks as g-prior, we performed Bayesian regression to discover the dependent relationship between predictors and responses, i.e., the best subset of predictors for each response was selected. The application in breast cancer data demonstrated superior performance of our method to detect biologically relevant epigenetic subnetworks.

Overall, Our contributions lie in the following aspects:

(1)
We proposed a novel method to detect epigenetic subnetworks by considering a set of highly correlated genes showing the pattern of differential co-expression/methylation instead of considering a single gene as a predictor or response. By comparing with EMDN, EpiMod and FEM which measure the association between gene expression and DNA methylation at the individual gene level, our detected epigenetic subnetworks were much more enriched in biological processes and signalling pathways, which indicates that evaluating the association between gene expression and DNA methylation at the module level would increase the biological association and shed light on the underlying mechanism. Furthermore, our method achieved a larger ratio of enriched subnetworks than that achieved by EMDN. This higher achievement in enrichment ratio is partially due to the construction of significant differential networks with the incorporation of gene interaction information to reduce false positives. The incorporation of the biological interaction networks may contribute to the discovery of enriched epigenetic subnetworks, however it could filter out important cancer genes which were not included in the prior network. Therefore, it remained to be a trade-off between filtering out false positives and discovering novel cancer mechanisms, which could be a future research direction for investigation.
(2)
By incorporating the prior biological knowledge as g-prior in a Bayesian regression model, it detected more significantly correlated epigenetic subnetworks than the alternative model without g-prior, which showed that encoding biological network information as g-prior effectively guided the selection of epigenetic subnetworks. It is possible to introduce other sources of prior information, such as the derived regulatory interactions in the literature.
(3)
The network analysis for the detected epigenetic subnetworks revealed the direct causal mechanisms verified in other scientific papers, which indicated the ability of our method in detecting true epigenetic mechanisms and that the detected epigenetic subnetworks could be a good start to uncover underlying epigenetic mechanisms. Moreover, the survival analysis for detected modules and epigenetic subnetworks indicated that the derived modules might be effective prognostic factors associated with the patients’ survival time.

References

Suzuki, M. M. & Bird, A. Dna methylation landscapes: Provocative insights from epigenomics. Nat. Rev. Genet. 9, 465 (2008).
Article CAS PubMed Google Scholar
Lahtz, C. & Pfeifer, G. P. Epigenetic changes of dna repair genes in cancer. J. Mol. Cell Biol. 3, 51–58 (2011).
Article CAS PubMed PubMed Central Google Scholar
Bernstein, C., Nfonsam, V., Prasad, A. R. & Bernstein, H. Epigenetic field defects in progression to cancer. World J. Gastrointest. Oncol. 5, 43 (2013).
Article PubMed PubMed Central Google Scholar
Heerboth, S. et al. Use of epigenetic drugs in disease: An overview. Genet. Epigenet. 6, S12270 (2014).
Article Google Scholar
Hu, M. et al. Distinct epigenetic changes in the stromal cells of breast cancers. Nat. Genet. 37, 899 (2005).
Article CAS PubMed Google Scholar
Ushijima, T. Detection and interpretation of altered methylation patterns in cancer cells. Nat. Rev. Cancer 5, 223 (2005).
Article CAS PubMed Google Scholar
Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113 (2013).
Article PubMed PubMed Central Google Scholar
Hinoue, T. et al. Genome-scale analysis of aberrant dna methylation in colorectal cancer. Genome Res. 22, 271–282 (2012).
Article CAS PubMed PubMed Central Google Scholar
Varley, K. E. et al. Dynamic dna methylation across diverse human cell lines and tissues. Genome Res. 23, 555–567 (2013).
Article CAS PubMed PubMed Central Google Scholar
Gevaert, O., Tibshirani, R. & Plevritis, S. K. Pancancer analysis of dna methylation-driven genes using methylmix. Genome Biol. 16, 17 (2015).
Article PubMed PubMed Central Google Scholar
Konno, M. et al. Computational trans-omics approach characterised methylomic and transcriptomic involvements and identified novel therapeutic targets for chemoresistance in gastrointestinal cancer stem cells. Sci. Rep. 8, 899 (2018).
Article ADS PubMed PubMed Central Google Scholar
Costa, R. L., Boroni, M. & Soares, M. A. Distinct co-expression networks using multi-omic data reveal novel interventional targets in hpv-positive and negative head-and-neck squamous cell cancer. Sci. Rep. 8, 15254 (2018).
Article ADS PubMed PubMed Central Google Scholar
Dabrowski, M. J. et al. Unveiling new interdependencies between significant dna methylation sites, gene expression profiles and glioma patients survival. Sci. Rep. 8, 4390 (2018).
Article ADS PubMed PubMed Central Google Scholar
West, J., Beck, S., Wang, X. & Teschendorff, A. E. An integrative network algorithm identifies age-associated differential methylation interactome hotspots targeting stem-cell differentiation pathways. Sci. Rep. 3, 1630 (2013).
Article ADS PubMed PubMed Central Google Scholar
Jiao, Y., Widschwendter, M. & Teschendorff, A. E. A systems-level integrative framework for genome-wide dna methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics 30, 2360–2366 (2014).
Article CAS PubMed Google Scholar
Ma, X., Liu, Z., Zhang, Z., Huang, X. & Tang, W. Multiple network algorithm for epigenetic modules via the integration of genome-wide dna methylation and gene expression data. BMC Bioinform. 18, 72 (2017).
Article Google Scholar
Wang, Z., Curry, E. & Montana, G. Network-guided regression for detecting associations between dna methylation and gene expression. Bioinformatics 30, 2693–2701 (2014).
Article CAS PubMed Google Scholar
Wang, X., Dalkic, E., Wu, M. & Chan, C. Gene module level analysis: Identification to networks and dynamics. Curr. Opin. Biotechnol. 19, 482–491 (2008).
Article CAS PubMed PubMed Central Google Scholar
Lareau, C. A., White, B. C., Oberg, A. L. & McKinney, B. A. Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure. BioData Mining 8, 5 (2015).
Article PubMed PubMed Central Google Scholar
Wu, J. et al. Integrated network analysis platform for protein–protein interactions. Nat. Methods 6, 75 (2009).
Article CAS PubMed Google Scholar
Kuang, D., Ding, C. & Park, H. Symmetric nonnegative matrix factorization for graph clustering. In Proc. 2012 SIAM International Conference on Data Mining, 106–117 (SIAM, 2012).
Zhang, S. et al. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 40, 9379–9391 (2012).
Article CAS PubMed PubMed Central Google Scholar
Liu, G., Wong, L. & Chua, H. N. Complex discovery from weighted ppi networks. Bioinformatics 25, 1891–1897 (2009).
Article CAS PubMed Google Scholar
Langfelder, P., Luo, R., Oldham, M. C. & Horvath, S. Is my network module preserved and reproducible? PLoS Comput. Biol. 7, e1001057 (2011).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Langfelder, P. & Horvath, S. Wgcna: An r package for weighted correlation network analysis. BMC Bioinform. 9, 559 (2008).
Article Google Scholar
Foroushani, A. et al. Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: An introduction to the pigengene package and its applications. BMC Med. Genomics 10, 16 (2017).
Article PubMed PubMed Central Google Scholar
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97, 10101–10106 (2000).
Article ADS CAS PubMed PubMed Central Google Scholar
Horvath, S. & Dong, J. Geometric interpretation of gene coexpression network analysis. PLoS Comput. Biol. 4, e1000117 (2008).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Zellner, A. On Assessing Prior Distributions and Bayesian Regression Analysis with g-prior Distributions (Bayesian Inference and Decision Techniques) (1986).
Breslow, N. E. Analysis of survival data under the proportional hazards model. Int. Stat. Rev. 43, 45–57 (1975).
Article MATH Google Scholar
De Las Rivas, J. & Fontanillo, C. Protein–protein interactions essentials: Key concepts to building and analyzing interactome networks. PLoS Comput. Biol. 6, e1000807 (2010).
Article PubMed Central Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).
Book MATH Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. (R Foundation for Statistical Computing, 2017).
Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102, 15545–15550 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. Kegg: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2016).
Article PubMed PubMed Central Google Scholar
Croft, D. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477 (2013).
Article PubMed PubMed Central Google Scholar
Nishimura, D. Biocarta. Biotechnol. Softw. Internet Rep. 2, 117–120 (2001).
Article Google Scholar
Ashburner, M. et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 25, 25 (2000).
Article CAS PubMed PubMed Central Google Scholar
Rubio-Perez, C. et al. In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. Cancer Cell 27, 382–396 (2015).
Article CAS PubMed Google Scholar
Lee, H., Dang, T., Lee, H. & Park, J. C. Oncosearch: Cancer gene search engine with literature evidence. Nucleic Acids Res. 42, W416–W421 (2014).
Article CAS PubMed PubMed Central Google Scholar
Jovanovic, J., Rønneberg, J. A., Tost, J. & Kristensen, V. The epigenetics of breast cancer. Mol. Oncol. 4, 242–254 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kar, S. et al. Expression profiling of dna methylation-mediated epigenetic gene-silencing factors in breast cancer. Clin. Epigenet. 6, 20 (2014).
Article Google Scholar
Laronga, C., Yang, H.-Y., Neal, C. & Lee, M.-H. Association of the cyclin-dependent kinases and 14-3-3 sigma negatively regulates cell cycle progression. J. Biol. Chem. 275, 23106–23112 (2000).
Article CAS PubMed Google Scholar
Klajic, J. et al. Dna methylation status of key cell-cycle regulators such as cdkna2/p16 and ccna1 correlates with treatment response to doxorubicin and 5-fluorouracil in locally advanced breast tumors. Clin. Cancer Res. 20, 6357–6366 (2014).
Article CAS PubMed Google Scholar
Sidhu, H. & Capalash, N. Uhrf1: The key regulator of epigenetics and molecular target for cancer therapeutics. Tumor Biol. 39, 1010428317692205 (2017).
Article Google Scholar
Liu, C. et al. Novel sorafenib analogues induce apoptosis through shp-1 dependent stat3 inactivation in human breast cancer cells. Breast Cancer Res. 15, 3254 (2013).
Article Google Scholar
Medina-Aguilar, R. et al. Dna methylation data for identification of epigenetic targets of resveratrol in triple negative breast cancer cells. Data Brief 11, 169–182 (2017).
Article PubMed PubMed Central Google Scholar
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Syst. 1695, 1–9 (2006).
Google Scholar
Kanehisa, M. & Goto, S. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Acknowledgements should be brief, and should not include thanks to anonymous referees and editors, or effusive comments. Grant or contribution numbers may be acknowledged.

Funding

The funding was provided by Natural Sciences and Engineering Research Council of Canada (Project No. RES0048688).

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, T6G 1H9, Canada
Anqi Jing & Han Jie

Authors

Anqi Jing
View author publications
Search author on:PubMed Google Scholar
Han Jie
View author publications
Search author on:PubMed Google Scholar

Contributions

A.J. designed the framework and implemented the method, collected the results, and wrote the manuscript. J.H. revised the manuscript. Both authors read and approved the final manuscript.

Corresponding authors

Correspondence to Anqi Jing or Han Jie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Supplementary Figures.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jing, A., Han, J. Inference of epigenetic subnetworks by Bayesian regression with the incorporation of prior information. Sci Rep 12, 20224 (2022). https://doi.org/10.1038/s41598-022-19879-x

Download citation

Received: 17 November 2020
Accepted: 06 September 2022
Published: 23 November 2022
Version of record: 23 November 2022
DOI: https://doi.org/10.1038/s41598-022-19879-x

Subjects

Abstract

Similar content being viewed by others

Blood-based DNA methylation in advanced Nasopharyngeal Carcinoma exhibited distinct CpG methylation signature

Converging genetic and epigenetic drivers of paediatric acute lymphoblastic leukaemia identified by an information-theoretic analysis

Computational analysis of DNA methylation from long-read sequencing

Introduction

Methods

Overall framework

Detection of predictor and response modules

Construction of differential networks

Detection of predictor and response modules

Significance test leading to the optimal selection of predictor and response modules

Module quality measures

Detection of epigenetic subnetworks

Module eigengene

Bayesian regression with g-prior

Encode prior biological information as g-prior

Best predictor subset selection based on BIC

Survival analysis

Results

Simulation study

Simulation dataset

Results and discussion

Case study

Dataset

Discovery of predictor and response modules

Module quality measures

Density-based measure

Separability-based measure

Other measures

Discovery of epigenetic subnetworks

Follow up analysis

Pathway enrichment test and network analysis

Survival analysis

Performance comparison

Discussion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Supplementary Figures.

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links