Introduction

It is estimated that only about 18% of individual yeast genes are essential in normal growth conditions in rich media, meaning that a deletion mutant with that gene deleted or suppressed is not viable1. For genes that are not essential, the S. cerevisiae genetic interaction network comes from high-throughput epistasis experiments, where edge weights represent the surprise in growth rates from double deletion mutants, compared to their associated single deletion mutants. In particular, a negative weight edge indicates that there is a growth defect (or in worst case synthetic lethality, meaning the double deletion mutant is not viable), and a positive weight edge indicates that the double deletion mutant is not sicker than the constituent single deletion mutants (or in best case, synthetic rescue, where although single deletion mutants display reduced growth, the double deletion mutant behaves like wild type)2. The pattern and organization of this signed, weighted genetic interaction network has been shown3 to contain interesting motifs that can indicate redundancy and the presence of compensatory pathways. These alternative pathways can be mechanisms of global resilience, where the cell can still accomplish an essential function through a different pathway if one is not functioning.

The genetic interaction network can be clustered into clusters that broadly recapture gene function, searched for genes that are globally highly pleiotropic, or, as is the main focus of the current work, mined for subgraphs that witness mechanisms of compensation and redundant pathways, generalizations of the so-called Between Pathway Model (BPM), first defined for the unweighted synthetic lethality data by Kelley and Ideker3.

Kelley and Ideker’s BPM is a network motif found in the superimposition of synthetic lethality genetic interaction (GI) edges, and physical protein-protein interaction (PI) edges. Consider a model consisting of a pair of protein pathways where each pathway serves as a redundant backup for the other. Within each pathway, there will be many physical interactions between nodes (protein-protein binding, direct transcriptional regulation, etc.), reflecting each pathway’s existence as a coherent functional unit. Synthetic-lethality interactions, on the other hand, will be few or nonexistent within each pathway, since the other pathway provides a failsafe mechanism for its partner. Between the two pathways, there will be more observed synthetic-lethality interactions: if corresponding components are deleted or suppressed in both pathways at once, the fault-tolerance of the system is defeated, and the strain dies. A network motif corresponding to this situation, in which two groups of genes, each group found to be edge-dense within the PI network, are connected by many synthetic-lethality edges in the GI network, defines the BPM (see Fig. 1: panel A). In Brady et al.4, we observed that the dense bipartite subgraphs that comprised the synthetic lethality edges in BPMs were analogous to 2-edge cuts, and used the theory of max cut problems to design algorithms to find putative BPM subgraphs in this setting; using the location of the known physical interaction edges as validation4. Thus our method required only the GI network as input, and found BPMs without considering the PI network. Other early papers also considered the task of finding BPMs in this unweighted setting5,6.

Advances in the technology led to high-throughput experiments using E-MAP7 and SGA8 technologies that were able to give a scalar weight to the genetic interaction edges (instead of just the binary viable/non-viable). The weights on genetic interaction edges were calculated from a null model of the expected sickness of the double mutant as compared with its constituent single deletion mutants. The earliest high-throughput experiments used a null model based on the fitness of the sicker of the single deletion mutants8. We created the LocalCut algorithm9,10 that generalized the notion of BPM to this setting, in the natural way (see the definition of gBPM below).

However, there is still an open question as to how to best to calculate the weights on the GI edges, based on the measured sickness and wellness of the double deletion mutant as compared to its constituent single deletion mutants. A study of Mani et al.11 showed that a log or multiplicative model behaved in a more informative manner than a model that considered the minimum sickness, particularly for uncovering epistatic relationships including genes whose single deletion mutant was highly deleterious11 (whereas for low-magnitude single deletion mutants, the models were largely equivalent). Thus most subsequent experiments12,13 moved to a multiplicative model for determining epistasis weights. However, the objective of the work of Mani et al.11 was to determine, in isolation, which pairs of genes showed epistasis, rather than searching for sets of genes that comprised generalized BPMs. There has not been a systematic study of how these weighting schemes influence the computational search for generalized BPMs. This is the subject of the present paper. We find that regardless of whether a minimum, multiple or logarithmic epistasis base model is employed, that squaring the edge weights (to pull them away from zero and denoise the signal, as recommended in Gallant et al10) improves discovery of generalized BPMs. We also find that some interesting generalized BPMs are uniquely discovered by each weighting scheme.

Fig. 1
figure 1

(A) Example BPM graph motif in the original Kelley and Ideker (unweighted) setting. Here nodes a-e form compensatory pathway 1 and f-h compensatory pathway 2; completing either pathway 1 or 2 is necessary for the cell to be viable. Solid lines are physical interaction edges, and dashed lines are synthetic lethality edges. Note that pathway 1 also has internal redundancy (only one of paralogs b and c is necessary to complete pathway 1) so that interactions between b or c and the nodes of pathway 2 is not lethal. We have colored the true synthetic lethality relation from g to e red to represent that perhaps this pair wasn’t tested and so that edge is missing from our data. (B) In the weighted setting, we consider genetic interaction edges only. Edges have both positive (purple) and negative (black) genetic interaction weights, with strength of line indicating relative magnitude of the edge weight. Within each pathway we have a near clique of low-weight positive interaction edges. The red edge indicates that the edge from e to c is erroneously reported as missing because it was not tested or below the noise threshold. In this setting, the pattern of negative interaction edges shows deleting a, d, or e together with a pathway 2 gene gives a synthetic growth defect (negative weight of smaller magnitude than the lethality edges in A), but deleting b or c and a pathway 2 gene gives growth similar to wildtype.

Related work

The LocalCut algorithm of Leiserson et al9 was shown to be better than a similar algorithm of Kelley and Kingsford14. Most recently, a new ILP method of Liany et al.15 was shown to outperform LocalCut when searching for weighted BPMs in two human cancer synthetic lethality networks. However, since their method uses a similar definition of a high scoring weighted BPM as LocalCut, only improving the search heuristic, any weight improvement for LocalCut should also transfer to their methods. Other recent related work (reviewed in Wang et al.16) is not directly comparable, because it primarily focuses on predicting individual synthetic lethality interactions based on integrating other sources of biological data such as physical protein-protein interaction networks, co-expression data, mutual exclusivity in cancer, or metablomics data17,18,19, though it is worth mentioning the work of Amar et al20 in particular, that uses combined physical PPI and genetic interaction data to search for network modules, a more general concept than BPMs.

Background

gBPM collections

In the unweighted case, the GI BPMs corresponded to a set of genes, partitioned into two pathways (AB) where the majority of GI (in this case Synthetic lethality) edges, fell between A and B, with one endpoint in each (hence the name “Between Pathway Model”). The natural generalization to weighted genetic interaction networks, instead searches for pairs of sets of genes, or pathways in which there are predominantly large negative edge weights in the inter-pathway edges, and predominantly positive edge weights among the intra-pathways edges (Fig. 1: panel B). These pairs of sets of genes we will call gBPMs for generalized between pathway models. More formally, we now introduce the following definition:

Definition 1

Let G be a graph with positive and negative real edge weights. A pair of disjoint subsets \((A,B) \in V(G)\) is a (kW)-gBPM if \(3 \le A,B \le k\), and \(M =\sum _{x \in A, y \in A} w(x,y) + \sum _{x \in B, y \in B} w(x,y) - \sum _{x \in A, y \in B} w(x,y) < -W.\) We refer to positive number \(-M\) as the score of the gBPM.

In this definition, A and B are the putative compensatory pathways, and edges between genes in opposite pathways should have large negative weights, while edges in the same pathway should have positive weights. As noted in past studies, there is an asymmetry between the positive and negative edge weights, where the negative edge weights are typically of greater magnitude than the positive edge weights. This does not affect the definition, of course, but it does mean that the between pathway edges tend to be more important than the within pathway edges when a subset (AB) qualifies to be labeled a gBPM.

Note, however, that whenever (AB) is a gBPM, many sets of genes that are highly overlapping with (AB) will also be a gBPM. For each specific gene g, we can ask if there exists a (kW)-gBPM that includes it. It is also interesting to construct collections of “unique” gBPMs, filtered by a threshold on how similar the sets (AB) and (CD) can be for them both to be placed in the same collection (we follow past work, where the Jaccard index21 was used to set this threshold).

Definition 2

A collection \({\mathcal {C}}\) of gBPMs is j-filtered, if when, \((A_s,B_s)\) and \((A_t,B_t)\) in \({\mathcal {C}}\) with \(s \ne t\), we have \(Jaccard(A_s \cup B_s, A_t \cup B_t) \le j\).

Definition 3

For collection \({\mathcal {C}}\) of gBPMs, \((A_i, B_i)\), with scores \(M_i\), Score \(({\mathcal {C}}) = \sum _{i \in {\mathcal {C}}} M_i\).

We note that we can also search for j-filtered collections of gBPMs in the unweighted setting by simply including edges of weight −1 for each synthetic lethal interaction.

The LocalCut method searches for j-filtered gBPMs9,10 in a weighted GI network, and was shown to find many meaningful compensatory pathways, including some involved in DNA repair, using the data of Tong et al.8. However, since that work, much more comprehensive collections of yeast epistatsis data have been generated. In this work, we look at how the schemas for generating edge weights (log, multiplicative or min) affects the number and quality of BPMs found by LocalCut in this new data.

Methods

Weighting schemes

Here are the alternative weighting schemes we test. Let \(S_a\) and \(S_b\) denote the fitness scores for single deletion mutations of genes a and b respectively. Let \(D_{a, b}\) denote the fitness score for the double deletion mutation for genes a and b. Let w(ab) be the weight given to the edge between genes a and b.

Definition 4

Minimum weighting. \(W(a, b) = D_{a, b} - min(S_a, S_b)\)

Definition 5

Multiplicative weighting. \(W(a, b) = D_{a, b} - (S_a * S_b)\)

Definition 6

Logarithmic weighting. \(W(a, b) = log_2(D_{a, b}) - (log_2(S_a) + log_2(S_b))\)

In the remainder of the paper, we abbreviate these weighting schemes as “min”, “mult” and “log” weighting, respectively.

In addition, because this was suggested in the original papers, we also considered a version where the edge weight magnitudes were squared (retaining the positive or negative sign). Squaring the weights is a way of denoising the networks, as weights of magnitude less than 1 will be pulled toward 0, and weights greater than 1 will be increased. Other weighting schemes were considered, such as a cubed version of minimum weighting, but were ultimately disregarded due to poor performance or high levels of noise.

LocalCut parameters

Once a method for choosing edge weights is fixed, LocalCut still has parameters that need to be set. These are m, the number of partitions LocalCut’s max cut algorithm runs, c, the percentage threshold of matching partitions needed to be added to a gBPM, j, the threshold for Jaccard index filtering, and minimumsize and maximumsize, the size requirements for a gBPM’s inclusion in the final output.

We chose to maintain LocalCut’s default values for m, j, minimumsize, and maximumsize (250, 0.66, 3, and 25 respectively), because the changes in scoring methods did not affect what these thresholds were designed to accomplish. More specifically, the maximum and minimum size of the gBPM, j, as well as the amount two different gBPMs returned by the algorithm are allowed to overlap are general parameters of the collection of gBPMs we are trying to find, so it was important to hold them fixed so that the returned collections of gBPMs by different weighting schemes were comparable. m is a robustsness parameter which is there to make sure the stochastic choices of the randomized algorithm always find consistent gene sets. However, the value of c correlates roughly to the sensitivity of the algorithm, and changes in scoring methods necessitated a closer look at this threshold. We did do some testing of different values of m for our different weighting schemes and different values of c, but setting \(m=250\) (the local cut default) remained easily large enough to produce stable results, in every case (results not shown).

Increasing the value of c leads to fewer genes being incorporated in our gBPMs, and fewer gBPMs overall, leading to more true positives. Decreasing c can lead to more genes being incorporated into our gBPMs but also more false positives. A scoring metric that pushes most scores close to zero benefits from a greater c value, since there tends to be less low scoring noise. Thus, a greater c value increases the stringency of the algorithm, reducing noise from the inherent randomness in the algorithm. Meanwhile, a scoring metric with a greater standard distribution of scores benefits from a lower c value, since many of the meaningful genes in a BPM may not pass a greater threshold of c.

Enrichment validation

In order to validate that our gBPMs represent biologically meaningful pathways, we first examine the Gene Ontology (GO) functional enrichment of our pathways. Though each theoretical gBPM has two distinct pathways, because they represent compensatory pathways, we expect them to be enriched for the same or similar terms. Enrichment analysis was performed using the g:Profiler API22. We used the 2,727 genes found in our input as the background, and corrected for multiple testing using the Benjamini-Hochberg correction23, selecting for \(p < 0.05\). A pathway was considered enriched if g:Profiler finds the pathway enriched in a GO term with less than 500 total genes. BPMs can then be categorized into four categories: No Pathways Enriched, meaning neither pathway is enriched under our definition, One Pathway Enriched, meaning one of the two pathways is enriched, Enriched for Different, meaning both pathways are enriched but do not share any commonly enriched terms (with less than 500 total genes), and Enriched for Same, meaning both pathways are enriched for at least one identical term.

Gene expression validation

Finally, our pathways should contain genes that are generally correlated with each other when looking at gene expression. The SPELL search engine24 version 2.0.3 is integrated with the Yeast Genome Database (SGD)25, and searches for yeast microarray expression datasets that contain a set of genes, to look for expression correlation. We scored each pathway (gBPM module) by taking the average correlation between each gene and every other gene in the pathway across each study. To create a null distribution, we randomized the genes in each gBPM module, drawing from the total pool of genes, and calculated the same scores. This removed any bias from the different number of genes in each pathway. To account for potential bias in the genes chosen in a module, we also tested a null distribution where we shuffled the genes in each module (ensuring that no genes were repeated in any given module). We preferred edge weighting schema where the gene expression correlation for the genes in an individual gBPM pathway is well separated from random.

Experimental setup

We used the raw scores for sickness and health of single and double mutant yeast strains from Constanzo et al.12. Because of computational constraints running LocalCut, we did not include all yeast genes, but first used a PPI network (as downloaded from STRING, version 11.526) to cluster sets of yeast genes as in Kolawole and Cowen27. Using physical interaction edges only and the “combined” confidence weight unchanged provided by STRING, we ran cDSD28 plus spectral clustering29 to cluster the genes in the PPI network, as implemented in the glidetools30 package plus the ScikitLearn31 implementation of spectral clustering. We asked spectral clustering for four clusters, and merged the largest and smallest cluster (since the smallest cluster contained only 2 genes). This gave us a subset of 2,727 genes out of the 5,848 total yeast genes in the dataset of Constanzo et al12 from which we constructed a complete genetic interaction network, weighting edges as described above, and where we search for gBPMs. The list of these 2,727 genes appears in the supplement.

Results

gBPM collections

We first explore the number of different gBPMs (where LocalCut only includes new gBPMs if their Jaccard similarity to an existing gBPM is no more than \(j =.66\), matching previous studies), the pathway size, and GO enrichment for each of the six different edge weighting schemes we test. We find that each of these values depends strongly on how the LocalCut consistency threshold c is set, and varies in turn by weighting scheme.

Our first result is that squaring the edge weights always improves performance, for all three different ways of calculating the weights (mult, min and log) and almost all settings of c. There is no consistent trend for the number of modules produced by the squared or unsquared versions of the weights (See Fig. 2), except for the min weighting scheme, where the unsquared weights result in the discovery of a sufficiently tiny number of modules that we ignore min with unsquared weights for the rest of this discussion. When we examine the percent of BPMs with both modules functionally enriched (sum of blue and orange bars in Fig. 3), as well as the proportion of enriched modules across all the returned BPMs, squaring the weights increases these values, usually substantially (see Fig. 4). On the other hand, the average number of genes in the modules goes down somewhat in almost all setting when the weights are squared: this is what would be expected as some of the extra genes are most likely noise, particularly as c is relaxed. The average module size across all weighting schemes lies between 4 and 12 genes (see Fig. 5), with smaller average module size with stricter c values. As c is decreased, it becomes more likely that genes that do not belong are included in a gBPM by chance, however, if gBPMs are being post-checked for enrichment, it may be preferred to allow a more permissive c value. For the rest of the discussion, we consider the squared versions of the edge weights.

In general, with squared weights, the 48 modules found by the min weighting scheme with \(c=70\) give the best values for the proportion of enriched modules as well as the largest proportion of gBPMs with both pathways enriched for the same GO function (Fig. 3), On the other hand, mult with \(c=70\) and log with \(c=70\) gave, by an order of magnitude, the largest number of candidate gBPMs (681 and 559, respectively), however, over one third of these candidate gBPMs display no known enrichment in either pathway, indicating that perhaps more than a third of these candidate gBPMs are just noise. Even when a gBPM is enriched, we expect that this setting of c is too permissive and will result in many spurious extra genes being included in the gBPM sets returned with this parameter.

Fig. 2
figure 2

When the weights are squared, the number of gBPM modules increases as the consistency parameter c is reduced from 90 to 70. Min produces the fewest gBPM modules, and almost none when not squared. When the weights are not squared, there are a larger number of gBPM modules with \(c=90\) and 80 for both mult and log, but many fewer gBPM modules than result from squaring the weights when \(c = 70\).

Fig. 3
figure 3

The proportion of enriched modules is consistently improved with squared weights for each of mult, min and log weighting schemes. Discounting min not squared, which produces at most 4 modules, min (squared) has the largest proportion gBPMs with both pathways enriched for some known function (blue plus orange bars). On the other hand, mult (squared) with \(c=90\) has the fewest proportion of gBPMs with neither module enriched for a known function (red bar): over 90% of the modules returned by this method have at least one component pathway functionally enriched.

Fig. 4
figure 4

Discounting min (not squared) which only produces 4 modules total, only mult (squared) at \(c=90\) and min (squared) at \(c=90\) and \(c=80\) produce module collections from their gBPMs that are over 90% enriched for known function. All weights and c values we tested produced module collections that were at least 50% enriched for known function. Squaring the weights for multi and log at the same c value improves the proportion of enriched modules across the board.

Fig. 5
figure 5

Average module size across all choices of weighting schemes, squaring and c parameter lie between 4 and 12 genes. Unsurprisingly, \(c=90\) for all weighting schemes gives the smallest average module size (again ignoring min (not squared) which produces very few gBPMs total).

Co-expression validation

We next compared the co-expression of the component pathways in our gBPMs as compared with random gene sets of the same size. Results for multi (squared) with \(c=90\) and \(c=80\) appear in Figs. 6 and 7 below. Full results for all weighting schemes appear in the supplement, against both random sets of genes (Figures S1-S12) and randomly shuffled sets of genes (Figures S13-S27). As can be seen in Figs. 6 and 7, there are a substantial number of gBPM modules whose expression correlation behaves differently from random gene collections when weights are squared. On the other hand, with unsquared weights, while the mult weighting scheme still produces gBPM modules that are more correlated than random genesets, the mult weighting scheme is nearly indistinguishable from random sets of genes with \(c=80\) (see Fig. 8). In general, the log and min weighting schemes also produces gBPM modules that are skewed from random genesets, especially with squared weights (see the graphs in the supplement).

Fig. 6
figure 6

The gBPM component pathways produced by multi with squared weights and \(c=80\) correlate significantly more often with gene expression than random genesets.

Fig. 7
figure 7

The gBPM component pathways produced by multi with squared weights and \(c=90\) correlate significantly more often with gene expression than random genesets.

Fig. 8
figure 8

The gBPM component pathways produced by multi with unsquared weights and \(c=80\) look very close to random genesets in terms of their coexpression correlation.

Some example interesting gBPMs and pathways

gBPMs found across multiple weighting schemes

Some of the same gBPMs and pathways we find are recapitulated across multiple weighting schemes. These gBPMs are very robust to weighting method, and include genes that are part of redudant pathways that have been identified in previous work. Because the Jaccard threshold under which we discard overlapping gBPMs is only set to 2/3, in some of the weighting schemes, we discover different parts of these same redundant pathways, multiple times. One of the most robust pathways that we find consistently across weighting schemes includes 4 components of the COG complex as one of the pathways. The COG complex consists of 8 genes (COG1-COG8), but deleting COG5, COG6, COG7 and COG8 simultaneously yields yeast strains that look like wildtype in laboratory conditions32. These 4 genes from the COG complex was also highlighted as a compensatory pathway in previous work9. The compensatory pathways involves genes such as VPS29 and VPS35 involved in retromer transport33, as well as the cargo receptor protein ERV1434. Other proteins appearing in compensatory pathway opposite the COG genes include IRS4, SYS1 and RAV1, identified by Bonangelino et al35 as vacuolar protein sorting genes. Since the COG complex is well-known to be involved in GOLGI trafficking36, it is possible that some of the retromer transport and cargo receptor, and vacuolar protein sorting genes identified can partially compensate in routing.

Another fault-tolerance mechanism that is recapitulated across multiple weighting schemes involves the well-known parallel pathways for DNA double break strand repair. This was also highlighted in Kolawole and Cowen27. The error-free RAD52 pathway and error-prone zeta DNA polymerase complex (containing genes REV1, REV3, and REV7) pathway for rescuing replication fork arrest are components of two alternative pathways for repairing DNA breaks37. Portions of these pathways are recovered under most of the gBPM weighting schemes we examined. Another pathway that showed up across multiple weighting schemes involved two sets of genes involved in miotic spindle positioning: for example, Rizk et al38 demonstrate that KIP3 plays a major role in regulating spindle length during anaphase, and it is found opposite genes critical to the dynein pathway39,40.

Interestingly, different weighting schemes also found multiple gBPMs containing one pathway including the Perixomal genes PEX1, PEX2, PEX3, PEX4, PEX8, PEX10, PEX13, and PEX14, whose functions in peroxisomal matrix protein import in S. cerevisiae is reviewed in Akcsit and van der Klei41. The genes in the other pathway are variable, but always contain the IML1 gene. There is no supporting literature that explains any role for IML1 in the peroxisomal matrix, so the consistency of this gene in the opposite pathway is unexpected and interesting.

gBPMs for particular weighting schemes

Looking at the multiplicative weighting scheme (squared) with \(c=80\), here is an example interesting gBPM that is not found using the min weighting scheme, and only overlaps pathways of the log weighting scheme when setting \(c=70\): Pathway 1 consists of the genes UBC7, EMP24, LAS21, SPC2 and OST3; whereas Pathway 2 cosists of the genes HAC1, OST6, and IRE1. The paralog pair OST3 and OST6’s synthetic lethality relation was already studied42. A relation between HAC1 and IRE1 was also known, where the accumulation of unfolded ER proteins activates the transmembrane kinase/nuclease Ire1p, triggering alternative splicing of HAC1 mRNA, initiating the unfolded protein response pathway43. As noted by Jonikas et al.44, deletions of either UBC7 or SPC2 will also initiate the unfolded protein response pathway. We thus hypothesize that both pathways in this gBPM are redundant mechanisms to trigger the unfolded protein response pathway.

Conversely, a set of genes that is functionally enriched for “anerobic respiration” in one gBPM module is found opposite multiple different sets of genes with no clear known functional coherence with \(c=90\) with the logarithmic weighting scheme (squared). The involved genes include BTS1, known to be essential for anerobic growth45, and MSN4, known to be involved in regulating key genes in anerobic response46. Additional genes in the module include DIA2, KIN1, NUP120, GUT2, MDM31, YFH7, and DOA1, where deletions mutants of DIA2 and YFH7 show up in large scale surveys as having decreased anerobic growth.

A list of all gBPMs organized into their 2 component pathways that are found under all 18 parameter choices we test (3 values of c, weighting schemes mult, min and log, weights squared or unsquared) is provided as supplementary files.

Discussion

We revisited the question of how to best compute pairwise genetic interaction weights to find putative compensatory gene sets in yeast. While when looking at only a pair of genes in isolation, Mani et al.11 showed convincingly that a multiplicative weighting scheme was more sensitive to signaling which pairs of genes show any epistatic effect at all; however we find that when using a method such as LocalCut9 that searches for conistent epistatic patterns across multiple gene pairs, looking only at the stronger epistasis signals (by squaring weights) will uncover more meaningful genesets.

We found c to be also an important hyperparameter whose optimal value can differ based on the weighting scheme chosen. For mult and log, we recommend setting c above 70 to reduce noise, while for min, setting c to 70 allows it to find a greater range of gBPMs.

Our results also demonstrated that different schemes can find unique BPMs not found by any other schemes, indicating that a full search for redundant pathways may benefit from running LocalCut multiple times with different weighting schemes. We suggest looking at multiple different weighting schemes to uncover interesting redundant pathways in future work.