PMC:4331678 / 26964-34812 JSON TXT

Annnotations TAB JSON ListView MergeView

{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/4331678","sourcedb":"PMC","sourceid":"4331678","source_url":"https://www.ncbi.nlm.nih.gov/pmc/4331678","text":"Protein function prediction\nWe use five-fold cross validation to investigate the performance of the algorithms in predicting protein function. More specifically, we divide each dataset into five disjoint folds. In each round, we take four folds as the training data and the remaining fold as the testing set, in which the proteins are considered as unlabeled and to be predicted. We record the results on the testing data to measure the performance. The parameters of the comparing methods are optimized via five-fold cross validation on the training data. Figure 1 gives the prediction performance of the comparing methods on the BP, CC, and MF functions of Yeast, respectively. The results on the other datasets are reported in Figures 1-3 of the Additional File 1.\nFigure 1 Prediction of the Biological Process (BP) functions, the Cellular Component (CC) functions, and the Molecule functions (MF) of the Yeast dataset. The groups from left to right give the prediction results with respect to the evaluation metrics MicroF1, MacroF1, Fmax, fAUC, and pAUC for the different algorithms. From the figures, we have several important observations. MNet almost always performs better than the other algorithms (including MSkNN) across all the evaluation metrics and all the three sub-ontologies (BP, CC, and MF) of GO, and the performance of the other methods fluctuate with respect to the different evaluation metrics. MNet also often outperforms MNet(λ = 0), which first uses kernel target alignment to obtain the composite network, and then applies classification on the composite network to predict protein functions. The difference between MNet and MNet(λ = 0) shows that it is important and beneficial to unify the composite network optimization with the prediction task on the composite network. MNet(λ = 0) performs better than SW in most cases, and both of them are solely based on the kernel target alignment to compute the weights on individual networks. The reason is that MNet (λ = 0) sets the weight of the edge between two proteins (such that one has the c-th function and the other currently does't) as nc+nc-l2, whereas SW sets it as -nc+nc-n2. For the evaluation metric fAUC, SW and MNet sometimes have comparable results, but SW often loses to MNet in other evaluation metrics. The reason is three-fold: (i) SW optimizes the composite network using kernel target alignment in advance, and then it performs binary classification on the composite network, whereas MNet unifies the optimization of the composite network and the network-based classifier for all the labels; (ii) SW specifies the label bias (often negative, since each label is annotated with a small number of proteins) for each binary label and MNet also sets the label bias (inversely proportional to the number of member proteins) to each binary label; (iii) fAUC is a function-centric evaluation metric and it equally averages the AUC scores of different labels, and the other evaluation metrics (i.e., Fmax and pAUC ) do not favor the binary predictor. In fact, most functional labels are only annotated with a rather small number of proteins. For this reason, we observe that the true positive rate is close to 1 in a wide range of false positive rates for a large number of functional labels. This fact also accounts for similar fAUC results of MNet and SW.\nAnother observation is that SW often loses to other comparing methods on MacroF1 and MicroF1. There are two reasons for this behavior: (i) SW applies binary classification on the composite network, but the other comparing algorithms do network-based classification for all the labels; (ii) MicroF1 and MacroF1 are computed based on the transformed binary indicative label vectors, and the binary indicative vector is derived from the largest elements of fi for each protein (see the metric definition in the Additional File 1 for more information); the other three metrics do not apply the binary transformation of fi. MSkNN uses a classifier ensemble to integrate multiple networks, and sometimes gets comparable results to other algorithms, which take advantage of the composite network to fuse multiple networks. These results show that classifier ensembles are another effective way to fuse multiple data sources for protein function prediction.\nProMK and OMG also integrate the optimization of composite network and the classifier, but they only use the loss of the classifier on the individual networks to determine the weights. LIG first utilizes soft spectral clustering to partition each input individual network into several subnetworks, and then determines the weights of these subnetworks solely based on the loss of the classifier on them. SW constructs a composite network in advance, and then train a classifier on the composite network to predict protein functions. Since it optimizes the composite network and the classifier on the composite network into two separate objectives, it often loses to other comparing algorithms. These facts support our motivation to unifying the composite network construction based on kernel target alignment and the network-based predictor optimization.\nEach dataset has more than thousands (or hundreds) of labels. These labels are highly unbalanced and each protein is annotated with a very small number of labels (i.e., each protein in the Human dataset on average has 13.52 BP labels and there are a total of 3413 BP labels). Since MacroF1 is more driven by the labels associated to fewer proteins, and MicroF1 is more affected by the labels associated to a larger number of proteins, the algorithms have larger values of MicroF1 than MacroF1. The difference between MNet and the other algorithms (including SW, which also considers the problem of unbalanced labels) on MacroF1 is more obvious than that on MicroF1. This observation indicates that MNet can handle the unbalanced problem much better than the other methods.\n\nThe Benefit of Weighting Functional Labels\nSome researchers [3,11,39] suggested that protein function prediction should be addressed as an unbalanced classification algorithm. Additional experiments were conducted to investigate the benefit of using Y˜ (weighted) in place of Y (unweighted). Y˜ differentially weights the labels, and puts more emphasis on the labels that have fewer member proteins. In contrast, Y equally weights all the labels. The definition of Y and Y˜ are provided in the section of Method. We report the results of MNet using Y˜ (weighted) and Y (unweighted) in Table 2 of the Additional File 1.\nTable 2 Runtime (in seconds).\nDataset GO MNet SW ProMK MSkNN LIG OMG\nYeast BP 2256.26 151.88 72.61 16.60 938.10 65.51\nCC 282.10 36.39 31.84 12.47 272.89 15.76\nMF 390.10 46.07 36.83 12.42 322.11 18.97\nHuman BP 19923.15 120.09 628.30 42.15 10309.56 447.01\nCC 1003.46 17.57 350.92 31.69 1496.33 96.61\nMF 1633.55 25.42 369.92 32.62 2195.25 116.59 MNet based on Y˜ performs better than MNet based on Y , especially for the BP labels, which are more unbalanced than the CC and the MF labels. MacroF1 is more affected by the labels that contains fewer proteins, and the performance difference between MNet based on Y˜ and MNet based on Y is more obvious for MacroF1 than for the other metrics. This fact shows that MNet based on Y˜ can more accurately predict the labels with few member proteins than MNet based on Y , and explicitly considering the unbalanced problem in data integration based protein function prediction can boost the prediction accuracy. These results support our motivation to use Y˜ instead of Y. However, we point out that there is still room to handle the unbalanced label problem for protein function prediction more efficiently, and how to achieve a more efficient weighting scheme for the labels is an important future direction to pursue.","divisions":[{"label":"title","span":{"begin":0,"end":27}},{"label":"p","span":{"begin":28,"end":767}},{"label":"figure","span":{"begin":768,"end":1089}},{"label":"label","span":{"begin":768,"end":776}},{"label":"caption","span":{"begin":778,"end":1089}},{"label":"p","span":{"begin":778,"end":1089}},{"label":"p","span":{"begin":1090,"end":3345}},{"label":"p","span":{"begin":3346,"end":4295}},{"label":"p","span":{"begin":4296,"end":5149}},{"label":"p","span":{"begin":5150,"end":5922}},{"label":"title","span":{"begin":5924,"end":5966}},{"label":"p","span":{"begin":5967,"end":6542}},{"label":"table-wrap","span":{"begin":6543,"end":6931}},{"label":"label","span":{"begin":6543,"end":6550}},{"label":"caption","span":{"begin":6552,"end":6573}},{"label":"p","span":{"begin":6552,"end":6573}},{"label":"table","span":{"begin":6574,"end":6931}},{"label":"tr","span":{"begin":6574,"end":6619}},{"label":"th","span":{"begin":6574,"end":6581}},{"label":"th","span":{"begin":6583,"end":6585}},{"label":"th","span":{"begin":6587,"end":6591}},{"label":"th","span":{"begin":6593,"end":6595}},{"label":"th","span":{"begin":6597,"end":6602}},{"label":"th","span":{"begin":6604,"end":6609}},{"label":"th","span":{"begin":6611,"end":6614}},{"label":"th","span":{"begin":6616,"end":6619}},{"label":"tr","span":{"begin":6620,"end":6675}},{"label":"td","span":{"begin":6620,"end":6625}},{"label":"td","span":{"begin":6627,"end":6629}},{"label":"td","span":{"begin":6631,"end":6638}},{"label":"td","span":{"begin":6640,"end":6646}},{"label":"td","span":{"begin":6648,"end":6653}},{"label":"td","span":{"begin":6655,"end":6660}},{"label":"td","span":{"begin":6662,"end":6668}},{"label":"td","span":{"begin":6670,"end":6675}},{"label":"tr","span":{"begin":6676,"end":6722}},{"label":"td","span":{"begin":6676,"end":6678}},{"label":"td","span":{"begin":6680,"end":6686}},{"label":"td","span":{"begin":6688,"end":6693}},{"label":"td","span":{"begin":6695,"end":6700}},{"label":"td","span":{"begin":6702,"end":6707}},{"label":"td","span":{"begin":6709,"end":6715}},{"label":"td","span":{"begin":6717,"end":6722}},{"label":"tr","span":{"begin":6723,"end":6769}},{"label":"td","span":{"begin":6723,"end":6725}},{"label":"td","span":{"begin":6727,"end":6733}},{"label":"td","span":{"begin":6735,"end":6740}},{"label":"td","span":{"begin":6742,"end":6747}},{"label":"td","span":{"begin":6749,"end":6754}},{"label":"td","span":{"begin":6756,"end":6762}},{"label":"td","span":{"begin":6764,"end":6769}},{"label":"tr","span":{"begin":6770,"end":6830}},{"label":"td","span":{"begin":6770,"end":6775}},{"label":"td","span":{"begin":6777,"end":6779}},{"label":"td","span":{"begin":6781,"end":6789}},{"label":"td","span":{"begin":6791,"end":6797}},{"label":"td","span":{"begin":6799,"end":6805}},{"label":"td","span":{"begin":6807,"end":6812}},{"label":"td","span":{"begin":6814,"end":6822}},{"label":"td","span":{"begin":6824,"end":6830}},{"label":"tr","span":{"begin":6831,"end":6880}},{"label":"td","span":{"begin":6831,"end":6833}},{"label":"td","span":{"begin":6835,"end":6842}},{"label":"td","span":{"begin":6844,"end":6849}},{"label":"td","span":{"begin":6851,"end":6857}},{"label":"td","span":{"begin":6859,"end":6864}},{"label":"td","span":{"begin":6866,"end":6873}},{"label":"td","span":{"begin":6875,"end":6880}},{"label":"tr","span":{"begin":6881,"end":6931}},{"label":"td","span":{"begin":6881,"end":6883}},{"label":"td","span":{"begin":6885,"end":6892}},{"label":"td","span":{"begin":6894,"end":6899}},{"label":"td","span":{"begin":6901,"end":6907}},{"label":"td","span":{"begin":6909,"end":6914}},{"label":"td","span":{"begin":6916,"end":6923}},{"label":"td","span":{"begin":6925,"end":6931}}],"tracks":[]}

PMC:4331678 / 26964-34812 JSONTXT

Annnotations TAB JSON ListView MergeView

PMC:4331678 / 26964-34812 JSON TXT