Background
Since the first complete genome was sequenced in 1995, more than 300 prokaryotic genomes and more than 20 eukaryotic genomes have been sequenced [1]. Discovering the functional roles of gene products after the completion of sequencing the Saccharomyces Cerevisiae genome has been in the spotlight of post-genomic era. High-throughput techniques [2-5] for protein-protein interactions (PPI) detection have attracted researchers' attention since interacting proteins are likely to serve together as a group in cellular functions [6]. In recent years, high-throughput techniques in a genomic scale such as yeast-two-hybrid, mass spectrometry, and protein chip technologies have multiplied the volume of protein interaction datasets exponentially and also have provided us a genomic level view of molecular interactions. The cumulative PPI dataset of, for example, S. Cerevisiae in DIP (Database of Interacting Proteins) [7] now lists over 4900 proteins and 18,000 interactions from over 22000 experiments; however, nearly half of the proteins remain unannotated. Effective computational systems for storage, management, visualization and analysis are necessary to cope with these fast growing complex datasets.
PPI data provide us the good opportunity to systematically analyze the structure of a large living system and also allow us to use it to understand essential principles like essentiality, genetic interactions, functions, functional modules, protein complexes and cellular pathways. Cellular functions and biochemical events are coordinately carried out by groups of proteins interacting each other in functional modules, and the modular structure of complex networks is critical to function [6,8,9]. Identifying such functional modules in PPI networks is very important for understanding the structure and function of these fundamental cellular networks. Therefore, developing an effective computational approach to identify functional modules should be highly challenging but indispensable. Clustering analysis helps us understand the topological structure of the PPI networks and relationships among its components better. And, the principal function of each cluster can be inferred from the functions of its member. Functions for unannotated members of a cluster can be predicted by the functions of other annotated members [10].
PPI adjacency matrices can be represented as graphs whose nodes represent proteins and edges represent interactions. The clustering of a PPI dataset can be thereby reduced to graph theoretical problems. But, the binary nature of the current PPI data sets imposes challenges in clustering using conventional approaches. In the maximal clique approach, clustering is reduced to identifying fully connected subgraphs in the graph [11]. To overcome the relatively high stringency imposed by the maximal clique method, the Quasi Clique [7], Molecular Complex Detection (MCODE) [12], Spirin and Mirny [11] algorithms identify densely connected subgraphs rather than fully connected ones by either optimizing an objective density function or using a density threshold. The Restricted Neighborhood Search Clustering Algorithm (RNSC) [13] and Highly Connected Subgraphs (HCS) algorithms [14] harness minimum cost edge cuts for cluster identification. The Markov Cluster Algorithm (MCL) algorithm finds clusters using iterative rounds of expansion and inflation that promote the strongly connected regions and weaken the sparsely connected regions, respectively [15]. The line graph generation approach [9] transforms the network of proteins connected by interactions into a network of connected interactions and then uses the MCL algorithm to cluster the interaction network. Samantha and Liang [16] employed a statistical approach to clustering of proteins based on the premise that a pair of proteins sharing a significantly larger number of common neighbors will have high functional similarity.
However, currently used approaches encounter challenges because the relationship between protein function and PPI is characterized by weak connectivity and unexpected topological phenomena, such as low intraconnectivity and longish shapes of actual topological shapes of MIPS functional categories [17]. In our experimental analysis, subgraphs of each functional categories in MIPS database [17] are extracted from the Yeast PPI network, and the density of each subgraph is measured by Equation 7. The density of those subgraphs is averaged about 0.0023 which is fairly lower than we expected. Most functional categories have low connectivity within them in the PPI network and the majority of members in functional categories do not have direct physical interaction with other members of the functional category they belong to. Furthermore, it is not difficult to find singletons in the extracted subgraphs of functional categories which means that some proteins do not have any interaction with other proteins in the functional category they belong to. Let the diameter of a graph be the length of the longest path among all pair shortest paths in the graph. The average diameter of the subgraphs of all MIPS functional categories is approximately 4 which is close to the average shortest paths length, 5.47, of the whole PPI network. In other words, the subgraphs of actual MIPS functional categories in the PPI network generally are not closely congregated as we expected, they have longish shapes. Due to such low density within the modules, the existing approaches have produced a number of clusters with small size and singletons and mercilessly discarded many weakly connected nodes since they can only handle the strongly connected regions. Such incompleteness of clustering is a serious drawback of the existing algorithms. Discarding the sparsely connected nodes could be a hazardous loss of important information of the PPI network.
Biological networks, including PPI networks, illustrate the biochemical relationships of components in biochemical processes. Clustering of biological networks should be able to identify clusters of any arbitrary shapes and any density if the members of a cluster share important biochemical properties from the point of view of biochemical processes. To cope with this necessity and overcome those drawbacks of existing approaches, we propose a novel strategy to statistically analyze the degree of biological and topological influence of each protein to other proteins in a PPI network. We model PPI networks as a dynamic signal transduction system (STM) and demonstrate the signal transduction behavior of the perturbation by each protein on PPI networks statistically. This behavior should also reflect the topological properties of the network. The overall signal transduction behavior function between any two proteins will be formulated to evaluate the perturbation caused by a protein on other proteins biologically and topologically in the network. STM successfully identified the clusters with bigger size, arbitrary shape, low density, and biologically more enriched than other existing approaches did.