PMC:1764415 / 31953-43942
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"17147822-14517352-1693556","span":{"begin":10024,"end":10026},"obj":"14517352"},{"id":"17147822-12711690-1693557","span":{"begin":10044,"end":10045},"obj":"12711690"},{"id":"17147822-14566057-1693558","span":{"begin":10058,"end":10060},"obj":"14566057"},{"id":"17147822-12060727-1693559","span":{"begin":10098,"end":10100},"obj":"12060727"}],"text":"2.3 Clustering performance analysis\nExperimentally, we performed STM algorithm on the yeast PPI data set using various merge threshold values to find the best threshold value for each data set. Experiments using 0.5,1.0,1.5, 2.0, 2.5, and 3.0 as the merge threshold were performed on each data set. The results show that when the merge threshold is less than 1.0, clusters that do not have substantial similarity are merged; and when the merge threshold is greater that 1.5, merging seldom occurred. There is no much performance difference when the values between 1.0 and 1.5 are used. The experiment when 1.0 is used as the merge threshold showed the best performance.\n\n2.3.1 Cluster analysis\n555 preliminary clusters are obtained from the yeast PPI network and merged using 1.0 as the merge threshold. In Table 3, all 60 clusters that have more than 4 proteins are listed, and it also shows their topological characteristics and their assigned molecular functions from MIPS functional categories. To facilitate critical assessments, the percentage of proteins that are in concordance with the major assigned function (hits), the discordant proteins (misses) and un-known are also indicated. Among these 60 clusters, the largest one contains 210 proteins and the smallest one contains 5 in them. On average, we have 40.1 proteins in a cluster, and the average density of the subgraphs of the clusters extracted from the PPI network is 0.2145. The -log p values of the major function identified in each cluster is also shown and these values provide a measure of the relative enrichment of a cluster for a given functional category: higher values of -log p indicate greater enrichment. The results demonstrate that the STM method can detect large but sparsely connected clusters as well as small densely connected clusters. The high values of -log p (values greater than 2.0 indicate statistical significance at α \u003c 0.01) indicate that clusters are significantly enriched for biological function and can be considered to be functional modules. As a result, our method can clearly identify larger modules that have low density but still biologically enriched as we can see from the size, the density, and the P-value of the clusters in Table 3.\nTable 3 STM clustering result on the yeast PPI dataset\nDistribution\nCluster Size Density H D U -Logp Function\n1 214 0.019 24.7 69.6 5.6 43.9 Nuclear transport\n2 188 0.015 69.1 25.0 5.8 36.4 Cell cycle and DNA processing\n3 181 0.022 22.0 72.3 5.5 17.2 Cytoplasmic and nuclear protein degradation\n4 170 0.028 46.4 42.9 10.5 31.6 Transported compounds (substrates)\n5 131 0.028 37.4 55.7 6.8 28.6 Vesicular transport (Golgi network, etc.)\n6 125 0.030 60.8 33.6 5.6 32.2 tRNA synthesis\n7 113 0.027 19.4 71.6 8.8 11.8 Actin cytoskeleton\n8 79 0.045 17.7 73.4 8.8 12.3 Homeostasis of protons\n9 78 0.033 26.9 62.8 10.2 12.5 Ribosome biogenesis\n10 76 0.041 38.1 59.2 2.6 20.2 rRNA processing\n11 72 0.030 5.6 84.7 9.7 6.2 Calcium binding\n12 68 0.064 66.1 25.0 8.8 44.5 mRNA processing\n13 61 0.041 40.9 52.4 6.5 11.5 Cytoskeleton\n14 58 0.064 72.4 27.6 0.0 37.4 General transcription activities\n15 53 0.048 15.0 71.6 13.2 7.9 MAPKKK cascade\n16 50 0.064 66.0 32.0 2.0 33.5 rRNA processing\n17 45 0.055 24.4 73.3 2.2 11.1 Metabolism of energy reserves\n18 44 0.058 59.0 36.3 4.5 5.1 Metabolism\n19 39 0.072 10.2 89.7 0.0 7.3 Cell-cell adhesion\n20 36 0.125 58.3 36.1 5.5 16.9 Vesicular transport\n21 29 0.091 55.1 44.8 0.0 8.3 Phosphate metabolism\n22 28 0.074 14.2 78.5 7.1 4.5 Lysosomal and vacuolar protein degradation\n23 27 0.119 29.6 66.6 3.7 7.3 Cytokinesis (cell division)/septum formation\n24 26 0.153 53.8 46.1 0.0 28.6 Peroxisomal transport\n25 25 0.090 28.0 68.0 4.0 4.6 Regulation of C-compound and carbohydrate utilization\n26 25 0.116 68.0 28 4.0 12.9 Cell fate\n27 22 0.151 59.0 36.3 4.5 11.4 DNA conformation modification\n28 21 0.147 76.1 19.0 4.7 23.9 Mitochondrial transport\n29 20 0.200 75.0 20.0 5.0 24.0 rRNA synthesis\n30 19 0.228 78.9 15.7 5.2 17.9 Splicing\n31 17 0.220 70.5 29.4 0.0 19.7 Microtubule cytoskeleton\n32 17 0.183 23.5 76.4 0.0 8.2 Regulation of nitrogen utilization\n33 15 0.304 86.6 13.3 0.0 31.3 Energy generation\n34 14 0.142 50.0 42.8 7.1 9.0 Small GTPase mediated signal transduction\n35 13 0.564 76.9 23.0 0.0 15.9 Mitosis\n36 13 0.358 84.6 15.4 0.0 12.4 DNA conformation modification\n37 13 0.410 69.2 23.0 7.6 17.6 3'-end processing\n38 13 0.179 61.5 30.7 7.6 6.7 DNA recombination and DNA repair\n39 12 0.196 16.6 75.0 8.3 3.9 Unspecified signal transduction\n40 12 0.363 58.3 41.6 0.0 14.7 Posttranslational modification of amino acids\n41 12 0.166 16.6 75.0 8.3 2.4 Autoproteolytic processing\n42 11 0.218 54.5 45.4 0.0 2.9 Transcriptional control\n43 11 0.200 72.7 27.2 0.0 8.2 Enzymatic activity regulation/enzyme regulator\n44 10 0.466 80.0 20.0 0.0 14.8 Translation initiation\n45 9 0.361 77.7 22.2 0.0 12.8 Translation initiation\n46 8 0.321 50.0 37.5 12.5 5.6 Metabolism of energy reserves\n47 8 0.321 75.0 25.0 0.0 9.0 Modification by ubiquitination, deubiquitination\n48 8 0.321 37.5 62.5 0.0 3.7 Mitosis\n49 7 0.333 42.8 57.1 0.0 3.5 DNA damage response\n50 7 0.333 57.1 28.5 14.2 4.1 Vacuolar transport\n51 7 0.285 28.5 71.4 0.0 4.4 Biosynthesis of serine\n52 6 0.333 50.0 33.3 16.6 2.38 Modification by phosphorylation, dephosphorylation, etc.\n53 5 0.400 100 0.0 0.0 7.0 Meiosis\n54 5 0.600 100 0.0 0.0 7.0 Vacuolar transport\n55 5 0.400 100 0.0 0.0 8.5 ER to Golgi transport\n56 5 0.400 20.0 40.0 40.0 1.8 cAMP mediated signal transduction\n57 5 0.500 40.0 40.0 20.0 3.1 Oxidative stress response\n58 5 0.500 80.0 20.0 0.0 4.4 Intracellular signalling\n59 5 0.600 40.0 60.0 0.0 4.2 Tetracyclic and pentacyclic triterpenes\n60 5 0.400 60.0 40.0 0.0 4.1 Mitochondrial transport\nThe first column is a cluster identifier; the Size column indicates the number of proteins in each cluster; the Density indicates the density of the cluster; the H column indicates the percentage of proteins concordant with the major function indicated in the last column; the D column indicates the percentage of proteins discordant with the major function and U column indicates percentage of proteins not assigned to any function. Figure 4 exhibits the distribution of the hit, miss, and unknown percentage of member proteins with the assigned function for each cluster in Table 3 for better understanding visually. We found that most of the proteins in a cluster have the same functions that are assigned as a main function for the cluster as shown in Figure 4.\nFigure 4 Distribution of the three classes of 60 clusters. Distribution of the three classes of 60 clusters: the hit percentage with the assigned function, discordant percentage from the assigned function, and unknown percentage.\n\n2.3.2 Comparative analysis\nThe results in Table 4 and 5 for the yeast PPI dataset show that STM generates larger clusters; the clusters identified had p-values that are 2.2 orders of magnitude or approximately 125-fold lower than Quasi clique, the best performing alternative clustering method, on biological function. The p-values for the cellular localization are also shown in the last column of Table 4 and 5. It is clear that the clusters identified by STM despite being larger have low p-values. Although p-values generally decrease with increasing cluster size, these decreases in p-values can occur only when the null hypothesis is false. The p-values reflect the confidence that the differences, if present, are not due to chance alone. The confidence in any given result increases when these are obtained in a larger sample and in this context. So, the dependence of p-values on sample size is intuitive. The p-values express the strength of evidence against the null hypothesis to account for both the sample size, the amount of noise in measurements. Therefore, the STM clusters have low p-values because they are enriched for function and not simply because they are larger.\nTable 4 Comparison of STM to competing clustering methods for clusters with 5 or more members\nMethod Number Size Discard(%) Function Location\nSTM 60 40.1 7.8 13.7 7.42\nMaximal clique 120 5.65 98.4 10.6 7.93\nQuasi clique 103 11.2 80.8 11.5 6.58\nSamantha 64 7.9 79.9 9.16 4.89\nMinimum cut 114 13.5 35.0 8.36 4.75\nBwtweenness cut 180 10.26 21.0 8.19 4.18\nMCL 163 9.79 36.7 8.18 3.97\nComparison of STM to competing clustering methods for the yeast protein-protein interaction data set for clusters with 5 or more members. The Number column indicates the number of clusters identified by each method, the Size column indicates the average number of proteins in each cluster; the Discard% indicates the percentage of proteins not assigned to any cluster. The -log p values for biological function and cellular location are shown.\nTable 5 Comparison of STM to competing clustering methods for clusters with 9 or more members\nMethod Number Size Discard(%) Function Location\nSTM 45 52.4 11.5 16.8 9.01\nMaximal clique N/A N/A N/A N/A N/A\nQuasi clique 46 16.7 86.7 15.3 9.34\nSamantha 17 12.3 93.3 15.9 7.65\nMinimum cut 44 24.3 55.0 14.8 8.78\nBwtweenness cut 78 14.4 50.5 11.3 6.05\nMCL 55 16.7 69.4 11.5 5.42\nComparison of STM to competing clustering methods for the yeast protein-protein interaction data set for clusters with 9 or more members. The Maximal clique does not identify clusters with 9 or more members. The footnote is the same to Table 4. Tables 4 and 5 demonstrate that STM outperforms the other existing approaches. We made a comparison with 6 other existing approaches, Maximal cliques [11], Quasi cliques [7], Samantha [23], Minimum cut [18], Betweenness cut [24], and MCL [15]. The comparison on the cluster size more than 4 is in Table 4 and on the cluster size more than 9 in Table 5. Both tables show that our signal transduction model based method generates considerably larger clusters, and the identified clusters by our method have at least 2 orders of magnitude higher P-value than the others on both function and localization categories.\nQuasi clique and Maximal clique discarded 80.8% and 98.4% nodes during clustering process, even though they identified the clusters with relatively high p-values in Table 4. Quasi clique and Samantha discarded 86.7% and 93.3% nodes, even though they identified the clusters with relatively high p-values in the clusters with size more than 9 in Table 5. Another important strength of STM is that the percentage of proteins that are discarded to create clusters is 7.8%, which is much lower than the other approaches, which have an average discard percentage of 59%. The yeast PPI dataset is relatively modular and the bottom-up approaches (e.g., maximal clique and quasi clique methods) generally outperformed the top-down approaches (exemplified by the minimum cut and betweeness cut methods) on functional enrichment as assessed by -log p. However because bottom-up approaches are based on connectivity of dense regions, the percentages of discarded nodes for the bottom-up methods are also higher than STM and the top-down approaches. But, we already have shown that the functional modules have fairly low density and arbitrary shapes with long diameter. So, discarding those sparsely connected proteins could be a fatal decision which might resulted in the important biological information losses. Consequently, STM is versatile and its performance on biological function and localization enrichment, cluster size, and discard rate is superior to the best of the other six methods on both data sets."}