2.3 Clustering performance analysis Experimentally, we performed STM algorithm on the yeast PPI data set using various merge threshold values to find the best threshold value for each data set. Experiments using 0.5,1.0,1.5, 2.0, 2.5, and 3.0 as the merge threshold were performed on each data set. The results show that when the merge threshold is less than 1.0, clusters that do not have substantial similarity are merged; and when the merge threshold is greater that 1.5, merging seldom occurred. There is no much performance difference when the values between 1.0 and 1.5 are used. The experiment when 1.0 is used as the merge threshold showed the best performance. 2.3.1 Cluster analysis 555 preliminary clusters are obtained from the yeast PPI network and merged using 1.0 as the merge threshold. In Table 3, all 60 clusters that have more than 4 proteins are listed, and it also shows their topological characteristics and their assigned molecular functions from MIPS functional categories. To facilitate critical assessments, the percentage of proteins that are in concordance with the major assigned function (hits), the discordant proteins (misses) and un-known are also indicated. Among these 60 clusters, the largest one contains 210 proteins and the smallest one contains 5 in them. On average, we have 40.1 proteins in a cluster, and the average density of the subgraphs of the clusters extracted from the PPI network is 0.2145. The -log p values of the major function identified in each cluster is also shown and these values provide a measure of the relative enrichment of a cluster for a given functional category: higher values of -log p indicate greater enrichment. The results demonstrate that the STM method can detect large but sparsely connected clusters as well as small densely connected clusters. The high values of -log p (values greater than 2.0 indicate statistical significance at α < 0.01) indicate that clusters are significantly enriched for biological function and can be considered to be functional modules. As a result, our method can clearly identify larger modules that have low density but still biologically enriched as we can see from the size, the density, and the P-value of the clusters in Table 3. Table 3 STM clustering result on the yeast PPI dataset Distribution Cluster Size Density H D U -Logp Function 1 214 0.019 24.7 69.6 5.6 43.9 Nuclear transport 2 188 0.015 69.1 25.0 5.8 36.4 Cell cycle and DNA processing 3 181 0.022 22.0 72.3 5.5 17.2 Cytoplasmic and nuclear protein degradation 4 170 0.028 46.4 42.9 10.5 31.6 Transported compounds (substrates) 5 131 0.028 37.4 55.7 6.8 28.6 Vesicular transport (Golgi network, etc.) 6 125 0.030 60.8 33.6 5.6 32.2 tRNA synthesis 7 113 0.027 19.4 71.6 8.8 11.8 Actin cytoskeleton 8 79 0.045 17.7 73.4 8.8 12.3 Homeostasis of protons 9 78 0.033 26.9 62.8 10.2 12.5 Ribosome biogenesis 10 76 0.041 38.1 59.2 2.6 20.2 rRNA processing 11 72 0.030 5.6 84.7 9.7 6.2 Calcium binding 12 68 0.064 66.1 25.0 8.8 44.5 mRNA processing 13 61 0.041 40.9 52.4 6.5 11.5 Cytoskeleton 14 58 0.064 72.4 27.6 0.0 37.4 General transcription activities 15 53 0.048 15.0 71.6 13.2 7.9 MAPKKK cascade 16 50 0.064 66.0 32.0 2.0 33.5 rRNA processing 17 45 0.055 24.4 73.3 2.2 11.1 Metabolism of energy reserves 18 44 0.058 59.0 36.3 4.5 5.1 Metabolism 19 39 0.072 10.2 89.7 0.0 7.3 Cell-cell adhesion 20 36 0.125 58.3 36.1 5.5 16.9 Vesicular transport 21 29 0.091 55.1 44.8 0.0 8.3 Phosphate metabolism 22 28 0.074 14.2 78.5 7.1 4.5 Lysosomal and vacuolar protein degradation 23 27 0.119 29.6 66.6 3.7 7.3 Cytokinesis (cell division)/septum formation 24 26 0.153 53.8 46.1 0.0 28.6 Peroxisomal transport 25 25 0.090 28.0 68.0 4.0 4.6 Regulation of C-compound and carbohydrate utilization 26 25 0.116 68.0 28 4.0 12.9 Cell fate 27 22 0.151 59.0 36.3 4.5 11.4 DNA conformation modification 28 21 0.147 76.1 19.0 4.7 23.9 Mitochondrial transport 29 20 0.200 75.0 20.0 5.0 24.0 rRNA synthesis 30 19 0.228 78.9 15.7 5.2 17.9 Splicing 31 17 0.220 70.5 29.4 0.0 19.7 Microtubule cytoskeleton 32 17 0.183 23.5 76.4 0.0 8.2 Regulation of nitrogen utilization 33 15 0.304 86.6 13.3 0.0 31.3 Energy generation 34 14 0.142 50.0 42.8 7.1 9.0 Small GTPase mediated signal transduction 35 13 0.564 76.9 23.0 0.0 15.9 Mitosis 36 13 0.358 84.6 15.4 0.0 12.4 DNA conformation modification 37 13 0.410 69.2 23.0 7.6 17.6 3'-end processing 38 13 0.179 61.5 30.7 7.6 6.7 DNA recombination and DNA repair 39 12 0.196 16.6 75.0 8.3 3.9 Unspecified signal transduction 40 12 0.363 58.3 41.6 0.0 14.7 Posttranslational modification of amino acids 41 12 0.166 16.6 75.0 8.3 2.4 Autoproteolytic processing 42 11 0.218 54.5 45.4 0.0 2.9 Transcriptional control 43 11 0.200 72.7 27.2 0.0 8.2 Enzymatic activity regulation/enzyme regulator 44 10 0.466 80.0 20.0 0.0 14.8 Translation initiation 45 9 0.361 77.7 22.2 0.0 12.8 Translation initiation 46 8 0.321 50.0 37.5 12.5 5.6 Metabolism of energy reserves 47 8 0.321 75.0 25.0 0.0 9.0 Modification by ubiquitination, deubiquitination 48 8 0.321 37.5 62.5 0.0 3.7 Mitosis 49 7 0.333 42.8 57.1 0.0 3.5 DNA damage response 50 7 0.333 57.1 28.5 14.2 4.1 Vacuolar transport 51 7 0.285 28.5 71.4 0.0 4.4 Biosynthesis of serine 52 6 0.333 50.0 33.3 16.6 2.38 Modification by phosphorylation, dephosphorylation, etc. 53 5 0.400 100 0.0 0.0 7.0 Meiosis 54 5 0.600 100 0.0 0.0 7.0 Vacuolar transport 55 5 0.400 100 0.0 0.0 8.5 ER to Golgi transport 56 5 0.400 20.0 40.0 40.0 1.8 cAMP mediated signal transduction 57 5 0.500 40.0 40.0 20.0 3.1 Oxidative stress response 58 5 0.500 80.0 20.0 0.0 4.4 Intracellular signalling 59 5 0.600 40.0 60.0 0.0 4.2 Tetracyclic and pentacyclic triterpenes 60 5 0.400 60.0 40.0 0.0 4.1 Mitochondrial transport The first column is a cluster identifier; the Size column indicates the number of proteins in each cluster; the Density indicates the density of the cluster; the H column indicates the percentage of proteins concordant with the major function indicated in the last column; the D column indicates the percentage of proteins discordant with the major function and U column indicates percentage of proteins not assigned to any function. Figure 4 exhibits the distribution of the hit, miss, and unknown percentage of member proteins with the assigned function for each cluster in Table 3 for better understanding visually. We found that most of the proteins in a cluster have the same functions that are assigned as a main function for the cluster as shown in Figure 4. Figure 4 Distribution of the three classes of 60 clusters. Distribution of the three classes of 60 clusters: the hit percentage with the assigned function, discordant percentage from the assigned function, and unknown percentage. 2.3.2 Comparative analysis The results in Table 4 and 5 for the yeast PPI dataset show that STM generates larger clusters; the clusters identified had p-values that are 2.2 orders of magnitude or approximately 125-fold lower than Quasi clique, the best performing alternative clustering method, on biological function. The p-values for the cellular localization are also shown in the last column of Table 4 and 5. It is clear that the clusters identified by STM despite being larger have low p-values. Although p-values generally decrease with increasing cluster size, these decreases in p-values can occur only when the null hypothesis is false. The p-values reflect the confidence that the differences, if present, are not due to chance alone. The confidence in any given result increases when these are obtained in a larger sample and in this context. So, the dependence of p-values on sample size is intuitive. The p-values express the strength of evidence against the null hypothesis to account for both the sample size, the amount of noise in measurements. Therefore, the STM clusters have low p-values because they are enriched for function and not simply because they are larger. Table 4 Comparison of STM to competing clustering methods for clusters with 5 or more members Method Number Size Discard(%) Function Location STM 60 40.1 7.8 13.7 7.42 Maximal clique 120 5.65 98.4 10.6 7.93 Quasi clique 103 11.2 80.8 11.5 6.58 Samantha 64 7.9 79.9 9.16 4.89 Minimum cut 114 13.5 35.0 8.36 4.75 Bwtweenness cut 180 10.26 21.0 8.19 4.18 MCL 163 9.79 36.7 8.18 3.97 Comparison of STM to competing clustering methods for the yeast protein-protein interaction data set for clusters with 5 or more members. The Number column indicates the number of clusters identified by each method, the Size column indicates the average number of proteins in each cluster; the Discard% indicates the percentage of proteins not assigned to any cluster. The -log p values for biological function and cellular location are shown. Table 5 Comparison of STM to competing clustering methods for clusters with 9 or more members Method Number Size Discard(%) Function Location STM 45 52.4 11.5 16.8 9.01 Maximal clique N/A N/A N/A N/A N/A Quasi clique 46 16.7 86.7 15.3 9.34 Samantha 17 12.3 93.3 15.9 7.65 Minimum cut 44 24.3 55.0 14.8 8.78 Bwtweenness cut 78 14.4 50.5 11.3 6.05 MCL 55 16.7 69.4 11.5 5.42 Comparison of STM to competing clustering methods for the yeast protein-protein interaction data set for clusters with 9 or more members. The Maximal clique does not identify clusters with 9 or more members. The footnote is the same to Table 4. Tables 4 and 5 demonstrate that STM outperforms the other existing approaches. We made a comparison with 6 other existing approaches, Maximal cliques [11], Quasi cliques [7], Samantha [23], Minimum cut [18], Betweenness cut [24], and MCL [15]. The comparison on the cluster size more than 4 is in Table 4 and on the cluster size more than 9 in Table 5. Both tables show that our signal transduction model based method generates considerably larger clusters, and the identified clusters by our method have at least 2 orders of magnitude higher P-value than the others on both function and localization categories. Quasi clique and Maximal clique discarded 80.8% and 98.4% nodes during clustering process, even though they identified the clusters with relatively high p-values in Table 4. Quasi clique and Samantha discarded 86.7% and 93.3% nodes, even though they identified the clusters with relatively high p-values in the clusters with size more than 9 in Table 5. Another important strength of STM is that the percentage of proteins that are discarded to create clusters is 7.8%, which is much lower than the other approaches, which have an average discard percentage of 59%. The yeast PPI dataset is relatively modular and the bottom-up approaches (e.g., maximal clique and quasi clique methods) generally outperformed the top-down approaches (exemplified by the minimum cut and betweeness cut methods) on functional enrichment as assessed by -log p. However because bottom-up approaches are based on connectivity of dense regions, the percentages of discarded nodes for the bottom-up methods are also higher than STM and the top-down approaches. But, we already have shown that the functional modules have fairly low density and arbitrary shapes with long diameter. So, discarding those sparsely connected proteins could be a fatal decision which might resulted in the important biological information losses. Consequently, STM is versatile and its performance on biological function and localization enrichment, cluster size, and discard rate is superior to the best of the other six methods on both data sets.