PMC:1481596 / 25506-25698 JSONTXT

Decomposition of overlapping protein complexes: A graph theoretical method for analyzing static and dynamic protein associations Abstract Background Most cellular processes are carried out by multi-protein complexes, groups of proteins that bind together to perform a specific task. Some proteins form stable complexes, while other proteins form transient associations and are part of several complexes at different stages of a cellular process. A better understanding of this higher-order organization of proteins into overlapping complexes is an important step towards unveiling functional and evolutionary mechanisms behind biological networks. Results We propose a new method for identifying and representing overlapping protein complexes (or larger units called functional groups) within a protein interaction network. We develop a graph-theoretical framework that enables automatic construction of such representation. We illustrate the effectiveness of our method by applying it to TNFα/NF-κB and pheromone signaling pathways. Conclusion The proposed representation helps in understanding the transitions between functional groups and allows for tracking a protein's path through a cascade of functional groups. Therefore, depending on the nature of the network, our representation is capable of elucidating temporal relations between functional groups. Our results show that the proposed method opens a new avenue for the analysis of protein interaction networks. Background A major challenge in systems biology is to understand the intricate network of interacting molecules. The complexity in biological systems arises not only from various individual protein molecules but also from their organization into systems with numerous interacting partners. In fact, most cellular processes are carried out by multi-protein complexes, groups of proteins that bind together to perform a specific task. Some proteins form stable complexes, such as the ribosomal complex that consists of more than 50 proteins and three RNA molecules, while other proteins form transient associations and are part of several complexes at different stages of a cellular process. A better understanding of this higher-order organization of proteins into overlapping complexes is an important step towards unveiling functional and evolutionary mechanisms behind biological networks. Data on protein complexes are collected from the study of individual systems, and more recently through high-throughput experiments, such as yeast two-hybrid (Y2H) [1,2] and tandem affinity purification followed by mass spectrometry (TAP/MS) [3,4]. The TAP/MS approach helps pinpoint proteins that interact with a tagged bait protein, either directly or indirectly, and are thus suited to identify multi-protein complexes. In fact, several research groups have systematically applied TAP/MS technology to study protein complexes involved in different signaling pathways [5]. Protein interactions are routinely represented as graphs, with proteins as nodes and interactions as edges (links). Therefore, it is not surprising that analysis of protein interaction networks reach out for a variety of graph-theoretical tools. Following the observation that protein interaction networks display a characteristic power-law like node degree distribution [6], a substantial body of research focused on statistical properties of protein interaction networks [7,8]. In 1999, Hartwell et al. [9] introduced a notion of a functional module, a group of cellular components and their interaction that can be attributed a specific biological function. The authors also suggested the modular organization of molecular interaction networks, where each functional module involves a small number of cellular components and is autonomous, i.e., its interaction with other modules is limited to a few cellular components. Subsequently, this assumption was used in several computational methods to identify protein complexes and functional modules in high-throughput protein interaction networks [10-15]. Some methods [10-13] look for densely connected subgraphs within a protein interaction network, either cliques or "cliquish" components. For example, Spirin et al. [13] use the term functional module to denote groups of proteins which are densely connected within themselves but sparsely connected with the rest of the network. Other methods [14,15] combine protein interaction with other information to identify functional modules, such as signal transduction pathways, that do not necessarily correspond to densely connected regions of the network. In a recent paper, Gagneur et al. applied modular decomposition to elucidate the organization of protein complexes [16]. The basic principle behind modular decomposition is to iteratively identify and contract nodes that are in a certain sense equivalent, until no more equivalent nodes can be found in the graph. A graph is called prime if it cannot be decomposed any further. Only graphs that belong to a very special graph family called cographs can be completely decomposed (that is, the iterative reduction process does not halt with a non-trivial prime graph). While the modular decomposition provides an excellent description of combinatorial variants within a family of complexes, it does not impose any order on the complexes within the family. As such it lacks the description power to represent the dynamics of complex formation, i.e., the manner in which proteins form transient interactions to participate in the complexes within the family. The order imposed on protein complexes within the family is particularly interesting if the family corresponds to a functional module where biological function is achieved through a dynamic formation of protein complexes and the order reflects this formation. In this work, we model a functional module as a union of overlapping dense subnetworks called here functional groups. A functional group is either a maximal clique (typically representing a protein complex) or a set of alternative variants of such complexes/cliques. As components of a larger functional module, functional groups are not assumed to be well separated and can have significant overlaps. Intuitively, if a functional module performs a function that requires a sequence of steps (like in the case of a signaling pathway) then we would like functional groups to be snapshots of protein associations at these steps. We propose a new method for identifying and representing overlapping functional groups in a functional module. Furthermore, if the module corresponds to a dynamic process that requires certain complexes (or more generally functional groups) come into contact in a specific order, our method attempts to discover this order. Our method is motivated by a fundamental result for chordal graphs [17], which states that every chordal graph has the so called clique tree representation. However, not every protein interaction network is chordal and not every functional group is a clique. Therefore, we developed a graph-theoretical framework that enables automatic construction of a tree-like representation, analogous to the clique tree representation, for much broader family of graphs. We call this representation the Tree of Complexes representation. The nodes in the tree are functional groups, and for every protein, the set of functional groups that contain this protein forms a single subtree. The "single subtree" requirement restricts significantly the way in which the nodes of the tree can be interconnected. As a consequence, this representation shows a smooth transition between functional groups and allows for tracking a protein's path through a cascade of functional groups. Therefore, depending on the nature of the network, the representation may be capable of elucidating temporal relations between functional groups. We developed a new method, Complex Overlap Decomposition (COD), that given a protein interaction network identifies its functional groups and constructs the Tree of Complexes representation. Our method requires that the network satisfies certain mathematical properties. We applied the COD method to several protein interaction networks, such as the TNFα/NF-κB signaling pathway and the pheromone signaling pathway. The corresponding subnetworks for all interaction networks are extracted from high throughput experimental data. Our results show that the COD method opens a new avenue for the analysis of protein interaction networks. Results and discussion One way to represent a set of overlapping functional groups is to construct a graph with nodes representing functional groups and edges representing overlaps, i.e., there exists an edge between two functional groups if and only if they share at least one protein. This approach has two shortcomings. First, it is not obvious how to correctly identify functional groups, and second, such a representation does not provide any information about the dynamics of proteins in the network. We propose a graph-theoretical approach, which, under the assumption that the protein interaction network satisfies certain mathematical properties, identifies functional groups and provides a representation of overlaps between functional groups in the form of the Tree of Complexes. Here, we first describe the COD method. Then, we demonstrate the utility of our approach by applying the COD method to several examples, derived from high-throughput experiments, TNFα/NF-κB and pheromone signaling pathway interaction networks. Complex overlap decomposition Our method of representing overlapping functional groups, which is depicted in Figure 1, builds on chordal and cograph graph theories. Chordal graphs constitute an important and well studied graph family [18,19]. A chord in a graph is any edge that connects two non-consecutive nodes of a cycle. A chordal graph is a graph which does not contain chordless cycles of length greater than three. Figure 1 Complex Overlap Decomposition. A simplified illustration of the Complex Overlap Decomposition (COD) method. An edge, (3, 4), connecting a pair of weak siblings is added to the graph. A fill-in edge between proteins 5 and 8 is added to eliminate all five 4-cycles in the graph: {5, 6, 8, 7}, {1, 5, 7, 8}, {2, 5, 7, 8}, {1, 5, 6, 8}, and {2, 5, 6, 8}. If the modified graph is chordal, all clique tree representations are computed (cf. Methods). Each clique tree representation results in a Tree of Complexes representation, where the Tree of Complexes is constructed by projecting each maximal clique in the modified graph, G*, to a functional group in the original graph G. For example, a four node maximal clique, {1, 2, 5, 8}, in G* is projected to a four node functional group in G, by removing a fill-in edge (5, 8). Each functional group is represented by a Boolean expression, such as (1 ∧ 2) ∧ (5 ∨ 8), which means that the functional group contains two variants of a complex, {1, 2, 5} and {1, 2, 8}. An important property of chordal graphs, which is explored directly in this paper, is that every chordal graph has a corresponding clique tree representation or clique tree [17]. The nodes in the tree are maximal cliques. Moreover, for every node in the graph, a set of maximal cliques that contain this node form a connected subgraph of the clique tree. Thus, there is a mapping between the nodes in the graph and subtrees in the clique tree. The "connected subgraph" requirement puts constraints on the topology of the clique tree. In fact, the topology of the tree is determined by the structure of overlaps between the maximal cliques in the graph. Thus, the clique tree captures information about the structure of the overlaps, which is lost in a simple clique intersection graph as shown in the example below. Example Consider a hypothetical protein interaction network in Figure 2(a). This network is chordal and its maximal cliques are listed in Figure 2(b). We want to contrast the clique tree representation in Figure 2(d) to a naive representation in Figure 2(c), where every pair of maximal cliques that contain a protein in common is connected by an edge. Figure 2 A Hypothetical Protein Interaction Network. (a) A hypothetical protein interaction network. (b) A list of all maximal cliques in the network. (c) A naive representation of overlaps between maximal cliques. Each maximal clique is a node and there is an edge between two maximal cliques if and only if they share a protein. (d) The clique tree representation. Once again, every maximal clique is a node, but the cliques are connected in such a way that the resulting graph is a tree. Moreover, cliques that contain a given protein form a connected subgraph. (e) This color scheme is used to show the subtree of every protein. For example, protein 3 is contained in maximal cliques A, B, and C, which is shown by placing yellow dots above the maximal cliques. While both representations show the overlap between maximal cliques, the interconnection pattern of cliques in the naive representation carries little additional information about the structure of this overlap. On the other hand, a very specific tree-like interconnection pattern in the clique tree representation can expose a special structure of such overlap. For example, consider maximal cliques B through F. In the naive representation, the overlap between these maximal cliques is collapsed to a clique. Thus, the representation treats the maximal cliques and overlaps between them equally. In particular, there is no way to tell that, for example, D occupies a more central position in the network than B. In the clique tree representation this information can be extracted from the relative position of cliques in the tree. For example, B is connected to F by a path that passes through C and D, which means that any protein shared by B and F is also contained in C and D. In other words, the overlap between B and F is entirely contained in the overlap between B and D, which in turn is entirely contained in the overlap between B and C. Thus, there is a correlation between the amount of overlap between maximal cliques and their distance in the clique tree. Nice properties of the clique tree mentioned above make it a good choice for representation of overlaps between functional groups. However, not every protein interaction network is chordal and maximal cliques may not always be the best way to represent functional groups. For example, in Figure 1, cliques {1, 2, 3} and {1, 2, 4} may correspond to two variants of one complex, where proteins 3 and 4 replace each other, forming one rather than two functional groups. Therefore, in the COD decomposition we relax the assumption that every functional group is a single protein complex (maximal clique) and allow it to contain several protein complexes (maximal cliques). In doing so we have to ensure that a functional group is not just any collection of protein complexes but rather a set of closely related protein complexes which represent possible variants of one complex (such as complexes {1, 2, 3} and {1, 2, 4} in the above example). To capture this systematically we model a functional group with a graph from a family of graphs known as cographs. Cographs are another well-studied graph family [20]. A cograph can be characterized by an absence of an induced subgraph which is a path of length four (P4), where the length is the number of nodes in the path. Thus, the diameter of a connected cograph is at most two. Subsequently, connected cographs are dense and cliquish, consistently with the assumption made by algorithms that delineate protein complexes. What makes cographs even more attractive is that for every cograph there exists a Boolean expression which describes all the maximal cliques in the graph. (In terms of modular decomposition used in [16] it means that a cograph can be decomposed by modular decomposition without leaving non-trivial non-decomposable prime module.) This Booolean expression describes in a compact and hierarchical way all the possible variants of protein complex within a functional group. The main idea behind COD method is to provide a representation of a functional module, which is analogous to the clique tree, in which nodes are cographs (representing variants of protein complex within a functional group) rather than maximal cliques. If we knew in advance all the functional groups in the module, we could simply connect the proteins within each functional group turning it into a clique and, under the assumption that the resulting graph is chordal, apply clique tree construction algorithm to the graph. Since we do not have predefined functional groups, our algorithm identifies them by adding edges to the graph in such a way that each added edge connects a pair of nodes that putatively belong to the same functional group. The COD method's edge addition strategy and its biological motivation builds on a concept of weak siblings. We call a pair of nodes weak siblings if and only if they are connected to the exactly the same set of neighbors, but are not connected to each other. In terms of protein interaction networks, weak siblings are proteins which interact with the same set of proteins but do not interact with each other. In particular, proteins that can substitute each other in a protein interaction network may have this property. Similarly, a pair of proteins that belong to the same complex but are not connected due to missing data or an experimental error will be represented as weak siblings. Since the weak siblings relationship suggests functional similarity, the COD method takes a first step towards delineation of functional groups by connecting every pair of weak siblings by an edge. As this modification may also eliminate some of the chordless cycles of length four (squares) in the graph, functional group delineation happens simultaneously with transformation of the protein interaction graph into a chordal graph. If, after connecting all pairs of weak siblings, the resulting graph is not chordal, the COD method attempts to transform it to chordal by adding some additional edges. Consistently with our assumption that we connect only nodes corresponding to proteins that could be put in the same functional group, we impose restrictions on this "fill-in" process. Namely, we require that, each introduced edge connects a pair of nodes which are close to being weak siblings. In such case the new edge is a diagonal of one or more squares in the graph. We emphasize that adding edges between nodes of longer cycles has no such justification. To summarize our edge addition procedure, our method attempts to eliminate all the squares in the protein interaction network by adding a limited set of diagonals that satisfies following conditions (i) connects potentially functionally equivalent proteins, as measured by the overlap in neighborhoods or distance from being a pair of weak siblings; (ii) ensures that functional groups correspond to cographs; we argue that this condition is guaranteed if the set of added edges does not form a P4 in a maximal clique of the modified graph (cf. Methods); If the modification step succeeds, i.e., the modified graph is chordal, all the clique tree representations of the modified graph are computed and then each clique tree is extended to a Tree of Complexes representation of the original graph. The COD algorithm keeps track of all the edge additions and uses this information to delineate functional groups by projecting each maximal clique onto original network and removing all introduced edges contained in the clique. For example, in the modified graph of Figure 1 a maximal clique with four nodes, {1, 2, 5, 8}, is projected to a functional group by removing an edge connecting proteins 5 and 8. This functional group contains two variants of a protein complex, {1, 2, 5} and {1, 2, 8}, which are compactly represented by a (1 ∧ 2) ∧ (5 ∨ 8) Boolean expression. If, on the other hand, the modified graph is not chordal, the COD method stops without producing the representation. Since the clique tree representation for a chordal graph is not unique, the Tree of Complexes representation that derives from it is not unique either. As all clique trees of a chordal graph have the same set of nodes (the nodes are the maximal cliques in the graph), the difference between clique trees comes from the topology of the tree. The clique tree topology is determined by the "connected subgraph" constraints and restriction power of these constraints depends on the structure of the underlying graph, i.e., there are graphs with a unique clique tree representation and there are graphs for which almost any tree that spans all the maximal cliques in the graph is a valid clique tree. As a result a protein interaction network may have several Tree of Complexes representations; all such representations will have the same functional groups but will differ in the way these functional groups are interconnected. For every protein interaction network analyzed bellow we explicitly state all the possible Tree of Complexes representations. TNFα/NF-κB signaling pathway To illustrate the power of COD in elucidating the dynamics behind protein complexes, we consider the TNFα/NF-κB signaling pathway. The Nuclear Factor κB (NF-κB) family of transcription factors is activated in response to a diverse set of stress stimuli, which includes pro-inflammatory cytokines, e.g., TNFα. In vertebrates, this family includes p50, p52, Rel A, c-Rel, and Rel B, which bind to the DNA in a homo or heterodimeric fashion. The NF-κB activity is regulated by the IκB family of proteins via inhibitory ankyrin repeat domains. This family includes IκBα, IκBβ, and IκB∈. The precursors of p50 (p105) and p52 (pl00) also possess ankyrin repeat domains and thus act as inhibitors. These precursors can also form dimers with other members of the NF-κB family. The activation with the pro-inflammatory cytokine tumor necrosis factor TNFα triggers a signaling cascade, which, in particular, stimulates the activation of the IKKα, IKKβ, and IKKγ functional groups. The IKKs initiate a signal induced degradation of the inhibitors (IκBs), and subsequent nuclear translocation of the transcription factor. Recent TAP experiments [5] provide a wealth of new information regarding this important signaling pathway. Bouwmeester et al. identified 221 molecular associations, out of which only 80 where previously known. Gagneur et al. [16] applied modular decomposition to the network of these associations but the decomposition halted quickly at large non-decomposable modules. We used the COD method to analyze the subnetwork spanning all the paths from NIK (NFκB-inducing kinase phosphorylating IKKα and IKKβ) with at most three edges. For the purpose of the analysis, we contracted all five members of NF-κB family into one node. As the resulting protein interaction network, shown in Figure 3(a), is chordal without weak siblings, functional groups correspond to maximal cliques in the network. Figure 3 TNFα/NF-κB Signaling Pathway. The TNFα/NF-κB signaling pathway. (a) The network. (b) The Tree of Complexes representation. The flow of action is visually represented by background colors: green for activators (IKKs) and yellow for inhibitors (IκBs, and p100). The NIK kinase is in the first functional group (A), together with all three members of the IKK complex and p100. Functional group B includes, in addition to p100, the IKKs and two inhibitors IκBα and IκBβ. This group is the beginning of interaction between IKKs and IκBs. Functional group C loses some of the IKKs, continues to show IκB and begins to show interaction between IκBs and NF-κB factors. Finally, in group E we see the entrance of NIK-independent Col-Tpl2 kinase. For this network, there are two alternative Tree of Complexes representations: functional group E can be connected to either D or C. The representation that maximizes the number of leaves is shown in Figure 3(b). One can clearly see the interplay between the activators and inhibitors. Proteins p105 and NF-κB participate in the same functional groups and thus follow the same path in the tree. The same is true for the pair of proteins IkBα and IkBβ. The Tree of Complexes captures this by grouping p105 and NF-κB, and IkBα and IkBβ. Pheromone signaling pathway The yeast Saccharomyces cerevisiae may be present in one of two haploid cell types, which are able to mate. Pheromones released by one type of cell bind to a specific receptor of the other type. This triggers the activation of a scaffold protein-bound mitogen-activated protein kinase (MAPK) cascade and subsequent activation of nuclear proteins that control subsequent cellular events. In a recent paper, Spirin et al. [13] identified a subnetwork of proteins involved in this process within a yeast protein interaction network [21]. We analyzed this subnetwork using the COD to see if our method can extract elements of temporal ordering. The subnetwork identified by Spirin et al. and its Tree of Complexes representation is given in Figure 4. In this case, the protein network is not chordal. First, the COD method identifies and connects a pair of weak siblings, MKKl and MKK2 .Then, to transform the network to a chordal graph, three additional edges are added: (SPH1, SPA2), (FUS3, KSS1), and (STE11, STE7). In this case, some functional groups will contain more than one protein complex. Figure 4 Pheromone Signaling Pathway. The pheromone signaling pathway. (a) The network. (b) The Tree of Complexes representation. For the description of the elements of the tree see the text. This network admits six different Tree of Complexes representations: (i) functional group H can be connected to either B or C; (ii) any interconnection pattern that spans groups E, F, and G can be chosen. If we ask for a tree with maximum number of leaves, the number of tree variants is reduced to two (option (i)). The MAPK cascade module consists of three sequentially acting protein kinases: MAP kinase kinase kinase (STE11) MAP kinase kinase (STE7) and MAP kinase (KSS1, FUS3) [22]. MKKl and MKK2 are two redundant protein kinase kinases (most similar to STE7) [23]. Their redundancy is properly captured by the ∨ (OR) in their functional group (H). The MAP kinases KSS1 and FUS3 are two separate kinases both activated by STE7 each of which is essential for a different program: FUS3 – for mating; KSS1 – for the filamentous growth [24]. Once again this is correctly captured by ∨ (OR) in groups F and G. STE5 is a scaffold protein of the MAPK module. It recruits MAPK module kinases (STE11, STE7, FUS3). This is consistent with the central position of a functional group containing STE5 in the tree and relative to the paths of STE7, STE11 and FUS3. Finally, nuclear proteins DIG1 and DIG2 (necessary for transcription inhibition, which are regulated by both FUS3 and KSS1) enter at the endpoint (node F) in the tree. Conclusion Recent advances in experimental techniques resulted in the accumulation of a vast amount of protein interaction information, which is routinely represented by protein interaction networks. Therefore it is not surprising that increasingly more complex graph-theoretical tools are deployed to analyze protein interaction graphs and extract biologically meaningful patterns. In general, graphs are not required to have any type of regularity. This makes them a very flexible tool which is able to represent complex relationships. However, this often also makes them computationally hard to deal with, for many problems in graph theory are NP-complete. Frequently graph theoretical problems can be simplified if some restrictions are imposed on the graph. Various restrictions give rise to various graph families. Given a graph family, it is usually very useful to be able to represent it using some kind of a tree. Such tree representation exposes a hierarchical organization that a graph may have, allowing for a structured analysis of it. In this work we proposed a tree representation for protein interaction graphs called Tree of Complexes representation. Nodes in the Tree of Complexes are functional groups and the tree satisfies the additional condition that functional groups that contain any fixed protein form a connected subgraph. In this way, our representation captures not only the overlap between functional groups but, potentially, also the manner in which proteins enter and leave their enclosing functional groups. We developed a method (together with the corresponding graph-theoretical theory) for efficient identification of such overlapping functional groups and construction of the corresponding Tree of Complexes. In particular, our method differs from other approaches in that it does not attempt to enumerate disjoint complexes but instead identifies and represents relations between overlapping functional groups. Even though the Tree of Complexes representation is not unique, the protein interaction networks that we analyzed admit very few alternative tree topologies. If we ask for a tree topology with a maximum number of leaves, as not to impose an artificial order between functional groups, the number of tree topologies is reduced even further. Thus, in the TNFα/NF-κB signalling pathway this results in a unique Tree of Complexes representation and in the pheromone signalling pathway in two very closely related possible Tree of Complexes representations. The nature of high-throughput protein interaction data does not directly imply that this data encodes temporal relations. We demonstrated that our method is frequently capable of discovering such temporal relations. Interestingly, temporal associations can also be implicated in the absence of actual interaction in the data. For example, in the case of the pheromone signaling pathway, our method correctly included KSS1 and FUS3 in the same functional group (treated here as temporal associations), despite the fact that there is no link between KSS1 and FUS3 in the input protein interaction network. Obviously, there are limitations to deciphering such temporal relations. For example, we cannot provide temporal ordering between different tree branches. Furthermore, if a functional group is not a clique but is represented as a Boolean expression indicating various possibilities for such group, then one can not be sure if these variants are mutually exclusive or if they represent partial information capturing incomplete data. Even in the case when the functional unit forms a clique it is still possible that it contains interactions that are not simultaneous. For example, interactions between pairs (A, B), (B, C) and (C, A) are represent as a three-vertex clique with nodes A, B, C and thus cannot be distinguished from a trimer (A, B, C). Such coincidences are less likely for larger cliques. Although our algorithm is not guaranteed to produce the Tree of Complexes representation for every possible protein interaction network, the algorithm will succeed for a broad family of graphs, which includes chordal graphs (and thus interval graphs) and cographs. Currently, our method can be applied to protein interaction networks that do not contain long (longer than four node) chordless cycles. As a consequence, it is more appropriate for analyzing dedicated subnetworks or modules than large protein interaction networks, which are expected to contain such long cycles. We distinguish between two different types of problematic networks for our method. First type includes networks for which imposing a temporal order that encompasses all functional groups in the network is meaningless. Second type includes networks for which such order is meaningful, but the assumption that the overlap between functional groups has a tree-like structure is not valid. We plan to extend our approach to deal with networks of the second type by utilizing graph-theoretical tools developed for other specialized graph families, such as arc-intersection graphs. Another issue that requires further investigation is the presence of noise in high-throughput protein interaction networks and its effect on the Tree of Complexes representation. While our method deals to some extent with false negatives, through its edge addition procedure, the issue of false positives is not addressed. We plan to explore alternative graph modification procedures that will incorporate both false negative and false positive interactions. Methods Computing clique trees We use an elegant and efficient strategy for chordal graph recognition outlined in [19]. We use an algorithm from [25] to compute all clique trees for a chordal graph. A compact boolean representation of functional groups Recall that a functional group corresponds to a maximal clique in the modified protein interaction network, with modifications being addition of edges between every pair of weak siblings and then addition of edges that eliminate all the squares in the graph. The edge addition is such that no maximal clique in the modified graph contains an induced P4 formed entirely by the added edges. Following lemma guarantees that every functional group is a cograph and therefore admits a compact Boolean representation. Lemma For every functional group, a subgraph of the original graph induced by the members of the group contains an induced P4 if and only if the set of edges added by our algorithm contains an induced P4. Proof The argument follows from Figure 5. Indeed, (v1, v2, v3, v4) is a P4 in the original graph if and only if (v3, v1, v4, v2) is a P4 formed by the added edges. Figure 5 P4. A P4 in the subgraph induced by the members of a functional group corresponds to a P4 in the set of added edges. Solid lines correspond to the original edges and dashed lines correspond to the added edges. Edge addition We use a reduction to the Minimum Vertex Cover problem to find all minimal sets of up to k edges that eliminate all the squares in the graph. Each eliminating set of edges, S = {e1,..., er}, is assigned a cost: where sim(ei) takes values between 1.0 and 0.0, and measures our confidence in adding ei to the graph. Since the addition of ei = (ui, vi) implies an interaction or functional equivalence between proteins ui and vi, we chose sim(ei) to be the amount of overlap between the neighborhoods of ui and vi, i.e., sim(ei) = , where (vi) denotes a set of neighbors of node vi in the graph. Intuitively, sim(ei) measures how close ui and vi are to being a pair of weak siblings. If ui and vi have the same neighborhoods then sim(ei) = 1.0; as the overlap between the neighborhoods decreases, sim(ei) goes to 0.0. Then, we pick an edge set with the minimum cost from all the edge sets that do not form a P4, which is entirely contained in one of the maximal cliques of the modified graph. The last requirement is necessary to ensure that each functional group is a cograph. Reduction to the minimum vertex cover A square in a graph can be eliminated by adding one or both of its diagonals (chords) to the graph. For example, a graph in Figure 6 has two squares: (A, B, C, D) and (A, B, E, D). Note that (B, C, D, E) is not a square as one of its diagonals, (C, E), is an edge in the graph. The square (A, B, C, D) can be eliminated if either edge (A, C) or (B, D) is added to the graph. Furthermore, a single diagonal can eliminate more than one square. For example, the diagonal (B, D)eliminates both squares. We are interested in finding all minimal sets of diagonals of size up to k that eliminate all the squares in the graph. Figure 6 VC Reduction. A graph and a corresponding "square coverage graph". We reduce the above problem to the Minimum Vertex Cover problem. The squares in the original graph become edges and diagonals become vertices in the new graph. Thus the original graph is transformed to a "square coverage" graph, which in turn serves as an input to the Minimum Vertex Cover problem. In the Minimum Vertex Cover problem we are given a graph and are asked to find the smallest set of vertices that cover all the edges in the graph. An edge is covered if at least one of its end points is selected. Coming back to our example, it can be easily seen that {(B, D)} is the minimum vertex cover (Figure 6). Although the Minimum Vertex Cover problem is an NP-hard problem, if the size of the optimum solution is small an efficient algorithm can be obtained. In other words the Minimum Vertex Cover problem is fixed parameter tractable. We use an O(2k(n + m)) algorithm [26] to identify all minimal sets of edges of size up to k that eliminate all the squares in the graph. Graphs that have tree of complexes representation The COD method is not guaranteed to produce the Tree of Complexes representation for every possible protein interaction network. How can a family of graphs that admit Tree of Complexes representation be characterized? First, we argue that chordal graphs belong to this family. It can be shown that addition of edges that connect weak siblings does not introduce chordless cycles to the graph. Therefore, after all weak siblings are connected the graph is still chordal and thus admits Tree of Complexes representation. Next, we argue that cographs admit Tree of Complexes representation. Our methods eliminates all the squares in the graph, unless every possible set of eliminating edges forms a P4, which is entirely contained in one of the maximal cliques of the modified graph. It can be shown that the latter case is possible only when the original graph contains a P4, and thus is not a cograph. We conjecture that graphs that admit Tree of Complexes representation are exactly those graphs that admit a clique tree representation, with the nodes being maximal cographs rather than maximal cliques. Competing interests The author(s) declare that they have no competing interests. Authors' contributions EZ implemented the algorithm implementation, and participated in design of experiments, analysis of results, and drafting of the manuscript. KG and RJ participated in design of experiments, analysis of results, and revising of the manuscript. TP conceived the project and participated in the design of experiments, analysis of results, and drafting of the manuscript. Acknowledgements This work was supported by the intramural research program of the National Institutes of Health, the National Library of Medicine. Visualization of protein interaction networks was performed using the Pajek program [27]. The authors would like to thank Dianne P. O'Leary for constructive comments.

Document structure show

Annnotations

blinded