2.3 Clustering performance analysis
Experimentally, we performed STM algorithm on the yeast PPI data set using various merge threshold values to find the best threshold value for each data set. Experiments using 0.5,1.0,1.5, 2.0, 2.5, and 3.0 as the merge threshold were performed on each data set. The results show that when the merge threshold is less than 1.0, clusters that do not have substantial similarity are merged; and when the merge threshold is greater that 1.5, merging seldom occurred. There is no much performance difference when the values between 1.0 and 1.5 are used. The experiment when 1.0 is used as the merge threshold showed the best performance.

2.3.1 Cluster analysis
555 preliminary clusters are obtained from the yeast PPI network and merged using 1.0 as the merge threshold. In Table 3, all 60 clusters that have more than 4 proteins are listed, and it also shows their topological characteristics and their assigned molecular functions from MIPS functional categories. To facilitate critical assessments, the percentage of proteins that are in concordance with the major assigned function (hits), the discordant proteins (misses) and un-known are also indicated. Among these 60 clusters, the largest one contains 210 proteins and the smallest one contains 5 in them. On average, we have 40.1 proteins in a cluster, and the average density of the subgraphs of the clusters extracted from the PPI network is 0.2145. The -log p values of the major function identified in each cluster is also shown and these values provide a measure of the relative enrichment of a cluster for a given functional category: higher values of -log p indicate greater enrichment. The results demonstrate that the STM method can detect large but sparsely connected clusters as well as small densely connected clusters. The high values of -log p (values greater than 2.0 indicate statistical significance at α < 0.01) indicate that clusters are significantly enriched for biological function and can be considered to be functional modules. As a result, our method can clearly identify larger modules that have low density but still biologically enriched as we can see from the size, the density, and the P-value of the clusters in Table 3.
Table 3  STM clustering result on the yeast PPI dataset
Distribution
Cluster  Size  Density  H  D  U  -Logp  Function
1  214  0.019  24.7  69.6  5.6  43.9  Nuclear transport
2  188  0.015  69.1  25.0  5.8  36.4  Cell cycle and DNA processing
3  181  0.022  22.0  72.3  5.5  17.2  Cytoplasmic and nuclear protein degradation
4  170  0.028  46.4  42.9  10.5  31.6  Transported compounds (substrates)
5  131  0.028  37.4  55.7  6.8  28.6  Vesicular transport (Golgi network, etc.)
6  125  0.030  60.8  33.6  5.6  32.2  tRNA synthesis
7  113  0.027  19.4  71.6  8.8  11.8  Actin cytoskeleton
8  79  0.045  17.7  73.4  8.8  12.3  Homeostasis of protons
9  78  0.033  26.9  62.8  10.2  12.5  Ribosome biogenesis
10  76  0.041  38.1  59.2  2.6  20.2  rRNA processing
11  72  0.030  5.6  84.7  9.7  6.2  Calcium binding
12  68  0.064  66.1  25.0  8.8  44.5  mRNA processing
13  61  0.041  40.9  52.4  6.5  11.5  Cytoskeleton
14  58  0.064  72.4  27.6  0.0  37.4  General transcription activities
15  53  0.048  15.0  71.6  13.2  7.9  MAPKKK cascade
16  50  0.064  66.0  32.0  2.0  33.5  rRNA processing
17  45  0.055  24.4  73.3  2.2  11.1  Metabolism of energy reserves
18  44  0.058  59.0  36.3  4.5  5.1  Metabolism
19  39  0.072  10.2  89.7  0.0  7.3  Cell-cell adhesion
20  36  0.125  58.3  36.1  5.5  16.9  Vesicular transport
21  29  0.091  55.1  44.8  0.0  8.3  Phosphate metabolism
22  28  0.074  14.2  78.5  7.1  4.5  Lysosomal and vacuolar protein degradation
23  27  0.119  29.6  66.6  3.7  7.3  Cytokinesis (cell division)/septum formation
24  26  0.153  53.8  46.1  0.0  28.6  Peroxisomal transport
25  25  0.090  28.0  68.0  4.0  4.6  Regulation of C-compound and carbohydrate utilization
26  25  0.116  68.0  28  4.0  12.9  Cell fate
27  22  0.151  59.0  36.3  4.5  11.4  DNA conformation modification
28  21  0.147  76.1  19.0  4.7  23.9  Mitochondrial transport
29  20  0.200  75.0  20.0  5.0  24.0  rRNA synthesis
30  19  0.228  78.9  15.7  5.2  17.9  Splicing
31  17  0.220  70.5  29.4  0.0  19.7  Microtubule cytoskeleton
32  17  0.183  23.5  76.4  0.0  8.2  Regulation of nitrogen utilization
33  15  0.304  86.6  13.3  0.0  31.3  Energy generation
34  14  0.142  50.0  42.8  7.1  9.0  Small GTPase mediated signal transduction
35  13  0.564  76.9  23.0  0.0  15.9  Mitosis
36  13  0.358  84.6  15.4  0.0  12.4  DNA conformation modification
37  13  0.410  69.2  23.0  7.6  17.6  3'-end processing
38  13  0.179  61.5  30.7  7.6  6.7  DNA recombination and DNA repair
39  12  0.196  16.6  75.0  8.3  3.9  Unspecified signal transduction
40  12  0.363  58.3  41.6  0.0  14.7  Posttranslational modification of amino acids
41  12  0.166  16.6  75.0  8.3  2.4  Autoproteolytic processing
42  11  0.218  54.5  45.4  0.0  2.9  Transcriptional control
43  11  0.200  72.7  27.2  0.0  8.2  Enzymatic activity regulation/enzyme regulator
44  10  0.466  80.0  20.0  0.0  14.8  Translation initiation
45  9  0.361  77.7  22.2  0.0  12.8  Translation initiation
46  8  0.321  50.0  37.5  12.5  5.6  Metabolism of energy reserves
47  8  0.321  75.0  25.0  0.0  9.0  Modification by ubiquitination, deubiquitination
48  8  0.321  37.5  62.5  0.0  3.7  Mitosis
49  7  0.333  42.8  57.1  0.0  3.5  DNA damage response
50  7  0.333  57.1  28.5  14.2  4.1  Vacuolar transport
51  7  0.285  28.5  71.4  0.0  4.4  Biosynthesis of serine
52  6  0.333  50.0  33.3  16.6  2.38  Modification by phosphorylation, dephosphorylation, etc.
53  5  0.400  100  0.0  0.0  7.0  Meiosis
54  5  0.600  100  0.0  0.0  7.0  Vacuolar transport
55  5  0.400  100  0.0  0.0  8.5  ER to Golgi transport
56  5  0.400  20.0  40.0  40.0  1.8  cAMP mediated signal transduction
57  5  0.500  40.0  40.0  20.0  3.1  Oxidative stress response
58  5  0.500  80.0  20.0  0.0  4.4  Intracellular signalling
59  5  0.600  40.0  60.0  0.0  4.2  Tetracyclic and pentacyclic triterpenes
60  5  0.400  60.0  40.0  0.0  4.1  Mitochondrial transport
The first column is a cluster identifier; the Size column indicates the number of proteins in each cluster; the Density indicates the density of the cluster; the H column indicates the percentage of proteins concordant with the major function indicated in the last column; the D column indicates the percentage of proteins discordant with the major function and U column indicates percentage of proteins not assigned to any function. Figure 4 exhibits the distribution of the hit, miss, and unknown percentage of member proteins with the assigned function for each cluster in Table 3 for better understanding visually. We found that most of the proteins in a cluster have the same functions that are assigned as a main function for the cluster as shown in Figure 4.
Figure 4  Distribution of the three classes of 60 clusters. Distribution of the three classes of 60 clusters: the hit percentage with the assigned function, discordant percentage from the assigned function, and unknown percentage.

2.3.2 Comparative analysis
The results in Table 4 and 5 for the yeast PPI dataset show that STM generates larger clusters; the clusters identified had p-values that are 2.2 orders of magnitude or approximately 125-fold lower than Quasi clique, the best performing alternative clustering method, on biological function. The p-values for the cellular localization are also shown in the last column of Table 4 and 5. It is clear that the clusters identified by STM despite being larger have low p-values. Although p-values generally decrease with increasing cluster size, these decreases in p-values can occur only when the null hypothesis is false. The p-values reflect the confidence that the differences, if present, are not due to chance alone. The confidence in any given result increases when these are obtained in a larger sample and in this context. So, the dependence of p-values on sample size is intuitive. The p-values express the strength of evidence against the null hypothesis to account for both the sample size, the amount of noise in measurements. Therefore, the STM clusters have low p-values because they are enriched for function and not simply because they are larger.
Table 4  Comparison of STM to competing clustering methods for clusters with 5 or more members
Method  Number  Size  Discard(%)  Function  Location
STM   60   40.1   7.8   13.7   7.42
Maximal clique  120  5.65  98.4  10.6  7.93
Quasi clique  103  11.2  80.8  11.5  6.58
Samantha  64  7.9  79.9  9.16  4.89
Minimum cut  114  13.5  35.0  8.36  4.75
Bwtweenness cut  180  10.26  21.0  8.19  4.18
MCL  163  9.79  36.7  8.18  3.97
Comparison of STM to competing clustering methods for the yeast protein-protein interaction data set for clusters with 5 or more members. The Number column indicates the number of clusters identified by each method, the Size column indicates the average number of proteins in each cluster; the Discard% indicates the percentage of proteins not assigned to any cluster. The -log p values for biological function and cellular location are shown.
Table 5  Comparison of STM to competing clustering methods for clusters with 9 or more members
Method  Number  Size  Discard(%)  Function  Location
STM   45   52.4   11.5   16.8   9.01
Maximal clique  N/A  N/A  N/A  N/A  N/A
Quasi clique  46  16.7  86.7  15.3  9.34
Samantha  17  12.3  93.3  15.9  7.65
Minimum cut  44  24.3  55.0  14.8  8.78
Bwtweenness cut  78  14.4  50.5  11.3  6.05
MCL  55  16.7  69.4  11.5  5.42
Comparison of STM to competing clustering methods for the yeast protein-protein interaction data set for clusters with 9 or more members. The Maximal clique does not identify clusters with 9 or more members. The footnote is the same to Table 4. Tables 4 and 5 demonstrate that STM outperforms the other existing approaches. We made a comparison with 6 other existing approaches, Maximal cliques [11], Quasi cliques [7], Samantha [23], Minimum cut [18], Betweenness cut [24], and MCL [15]. The comparison on the cluster size more than 4 is in Table 4 and on the cluster size more than 9 in Table 5. Both tables show that our signal transduction model based method generates considerably larger clusters, and the identified clusters by our method have at least 2 orders of magnitude higher P-value than the others on both function and localization categories.
Quasi clique and Maximal clique discarded 80.8% and 98.4% nodes during clustering process, even though they identified the clusters with relatively high p-values in Table 4. Quasi clique and Samantha discarded 86.7% and 93.3% nodes, even though they identified the clusters with relatively high p-values in the clusters with size more than 9 in Table 5. Another important strength of STM is that the percentage of proteins that are discarded to create clusters is 7.8%, which is much lower than the other approaches, which have an average discard percentage of 59%. The yeast PPI dataset is relatively modular and the bottom-up approaches (e.g., maximal clique and quasi clique methods) generally outperformed the top-down approaches (exemplified by the minimum cut and betweeness cut methods) on functional enrichment as assessed by -log p. However because bottom-up approaches are based on connectivity of dense regions, the percentages of discarded nodes for the bottom-up methods are also higher than STM and the top-down approaches. But, we already have shown that the functional modules have fairly low density and arbitrary shapes with long diameter. So, discarding those sparsely connected proteins could be a fatal decision which might resulted in the important biological information losses. Consequently, STM is versatile and its performance on biological function and localization enrichment, cluster size, and discard rate is superior to the best of the other six methods on both data sets.