Results and discussion Yeast, human, mouse and fly datasets We evaluate our methodology on benchmark networks of four datasets obtained from the study by [16], which cover four species: yeast, human, mouse, and fly. The Yeast dataset includes 44 functional association networks, the Human dataset includes 8 networks, the Mouse dataset consists of 10 networks, and the Fly dataset has 38 networks. These datasets are publicly available at http://morrislab.med.utoronto.ca/~sara/SW/, and more information about them can be found in [16]. We annotated the proteins in each dataset using the recently updated GO term annotation (access date: 2014-05-13) in three sub-ontologies, namely biological process (BP) functions, molecular functions (MF), and cellular component (CC) functions, respectively. Each protein is also annotated with its ancestral function labels. As suggested by Pandey et al. [19], the functional labels which have too few member proteins are not likely to be testable in the wet lab, and thus of no interest to biologists. We retained the function labels which have at least 10 member proteins. In addition, we removed the functional labels that have more than 300 member proteins: these functional labels are too general and their prediction is not as critical as for the others [25,39]. The statistic of these datasets is given in Table 1. In the table, the BP labels are the biological process functions (or terms), the MF labels are the molecular functions, and the CC labels are the cellular component functions in the Gene Ontology. Table 1 Dataset statistics. Dataset #Proteins #Networks #BPs #MFs #CCs Yeast 3904 44 1089 307 224 Human 13281 8 3413 681 438 Mouse 21603 10 4123 818 511 Fly 13562 38 1883 481 315 '#Proteins' represents the number of proteins in a dataset, '#Networks' means the number of functional association networks, '#BPs' denotes the number of BP labels, '#MFs' denotes the number of MF labels, and '#CCs' denotes the number of CC labels. Comparing algorithms and evaluation metrics We compared our proposed MNet with other related algorithms: ProMK [25], SW [16], OMG [27], LIG [28], and MSkNN [7]. MSkNN first trains a weighted majority vote [31] classifier (similar to a weighted kNN) on each individual network, and then integrates these classifiers for protein function prediction; it achieves competent performance on the first large-scale community based critical assessment of protein function annotation [2]. The details of the other comparing methods were introduced in the section of Related Work, and their parameter setting is discussed in the Additional File 1. The quality of protein function prediction can be evaluated according to different criteria, and the choice of evaluation metrics differentially affects different prediction algorithms [2]. For a fair and comprehensive comparison, five evaluation metrics are used in this paper, namely MacroF1, MicroF1, Fmax, function-wise Area Under the Curve (fAUC ), and protein-wise AUC (pAUC ). These evaluation metrics are extensively applied to evaluate the performance of multilabel learning algorithms and protein function prediction [2,7,25,40]. More information about these evaluation metrics is provided in the Additional File 1. For an evaluation metric, since there are more than hundreds (or thousands) of labels for a dataset, a small performance difference between two comparing algorithms is also significant. Protein function prediction We use five-fold cross validation to investigate the performance of the algorithms in predicting protein function. More specifically, we divide each dataset into five disjoint folds. In each round, we take four folds as the training data and the remaining fold as the testing set, in which the proteins are considered as unlabeled and to be predicted. We record the results on the testing data to measure the performance. The parameters of the comparing methods are optimized via five-fold cross validation on the training data. Figure 1 gives the prediction performance of the comparing methods on the BP, CC, and MF functions of Yeast, respectively. The results on the other datasets are reported in Figures 1-3 of the Additional File 1. Figure 1 Prediction of the Biological Process (BP) functions, the Cellular Component (CC) functions, and the Molecule functions (MF) of the Yeast dataset. The groups from left to right give the prediction results with respect to the evaluation metrics MicroF1, MacroF1, Fmax, fAUC, and pAUC for the different algorithms. From the figures, we have several important observations. MNet almost always performs better than the other algorithms (including MSkNN) across all the evaluation metrics and all the three sub-ontologies (BP, CC, and MF) of GO, and the performance of the other methods fluctuate with respect to the different evaluation metrics. MNet also often outperforms MNet(λ = 0), which first uses kernel target alignment to obtain the composite network, and then applies classification on the composite network to predict protein functions. The difference between MNet and MNet(λ = 0) shows that it is important and beneficial to unify the composite network optimization with the prediction task on the composite network. MNet(λ = 0) performs better than SW in most cases, and both of them are solely based on the kernel target alignment to compute the weights on individual networks. The reason is that MNet (λ = 0) sets the weight of the edge between two proteins (such that one has the c-th function and the other currently does't) as nc+nc-l2, whereas SW sets it as -nc+nc-n2. For the evaluation metric fAUC, SW and MNet sometimes have comparable results, but SW often loses to MNet in other evaluation metrics. The reason is three-fold: (i) SW optimizes the composite network using kernel target alignment in advance, and then it performs binary classification on the composite network, whereas MNet unifies the optimization of the composite network and the network-based classifier for all the labels; (ii) SW specifies the label bias (often negative, since each label is annotated with a small number of proteins) for each binary label and MNet also sets the label bias (inversely proportional to the number of member proteins) to each binary label; (iii) fAUC is a function-centric evaluation metric and it equally averages the AUC scores of different labels, and the other evaluation metrics (i.e., Fmax and pAUC ) do not favor the binary predictor. In fact, most functional labels are only annotated with a rather small number of proteins. For this reason, we observe that the true positive rate is close to 1 in a wide range of false positive rates for a large number of functional labels. This fact also accounts for similar fAUC results of MNet and SW. Another observation is that SW often loses to other comparing methods on MacroF1 and MicroF1. There are two reasons for this behavior: (i) SW applies binary classification on the composite network, but the other comparing algorithms do network-based classification for all the labels; (ii) MicroF1 and MacroF1 are computed based on the transformed binary indicative label vectors, and the binary indicative vector is derived from the largest elements of fi for each protein (see the metric definition in the Additional File 1 for more information); the other three metrics do not apply the binary transformation of fi. MSkNN uses a classifier ensemble to integrate multiple networks, and sometimes gets comparable results to other algorithms, which take advantage of the composite network to fuse multiple networks. These results show that classifier ensembles are another effective way to fuse multiple data sources for protein function prediction. ProMK and OMG also integrate the optimization of composite network and the classifier, but they only use the loss of the classifier on the individual networks to determine the weights. LIG first utilizes soft spectral clustering to partition each input individual network into several subnetworks, and then determines the weights of these subnetworks solely based on the loss of the classifier on them. SW constructs a composite network in advance, and then train a classifier on the composite network to predict protein functions. Since it optimizes the composite network and the classifier on the composite network into two separate objectives, it often loses to other comparing algorithms. These facts support our motivation to unifying the composite network construction based on kernel target alignment and the network-based predictor optimization. Each dataset has more than thousands (or hundreds) of labels. These labels are highly unbalanced and each protein is annotated with a very small number of labels (i.e., each protein in the Human dataset on average has 13.52 BP labels and there are a total of 3413 BP labels). Since MacroF1 is more driven by the labels associated to fewer proteins, and MicroF1 is more affected by the labels associated to a larger number of proteins, the algorithms have larger values of MicroF1 than MacroF1. The difference between MNet and the other algorithms (including SW, which also considers the problem of unbalanced labels) on MacroF1 is more obvious than that on MicroF1. This observation indicates that MNet can handle the unbalanced problem much better than the other methods. The Benefit of Weighting Functional Labels Some researchers [3,11,39] suggested that protein function prediction should be addressed as an unbalanced classification algorithm. Additional experiments were conducted to investigate the benefit of using Y˜ (weighted) in place of Y (unweighted). Y˜ differentially weights the labels, and puts more emphasis on the labels that have fewer member proteins. In contrast, Y equally weights all the labels. The definition of Y and Y˜ are provided in the section of Method. We report the results of MNet using Y˜ (weighted) and Y (unweighted) in Table 2 of the Additional File 1. Table 2 Runtime (in seconds). Dataset GO MNet SW ProMK MSkNN LIG OMG Yeast BP 2256.26 151.88 72.61 16.60 938.10 65.51 CC 282.10 36.39 31.84 12.47 272.89 15.76 MF 390.10 46.07 36.83 12.42 322.11 18.97 Human BP 19923.15 120.09 628.30 42.15 10309.56 447.01 CC 1003.46 17.57 350.92 31.69 1496.33 96.61 MF 1633.55 25.42 369.92 32.62 2195.25 116.59 MNet based on Y˜ performs better than MNet based on Y , especially for the BP labels, which are more unbalanced than the CC and the MF labels. MacroF1 is more affected by the labels that contains fewer proteins, and the performance difference between MNet based on Y˜ and MNet based on Y is more obvious for MacroF1 than for the other metrics. This fact shows that MNet based on Y˜ can more accurately predict the labels with few member proteins than MNet based on Y , and explicitly considering the unbalanced problem in data integration based protein function prediction can boost the prediction accuracy. These results support our motivation to use Y˜ instead of Y. However, we point out that there is still room to handle the unbalanced label problem for protein function prediction more efficiently, and how to achieve a more efficient weighting scheme for the labels is an important future direction to pursue. Network relevance estimation Different networks present different levels of quality for protein function prediction. To investigate whether MNet can assign a large weight to a network that can produce accurate predictions, and assign a small weight to a network that poorly predicts protein functions, we recorded the results of MNet (see Eq. (1)) for individual networks and the corresponding weights (αm). We also recorded the results and the weights of SW and ProMK on individual networks. For a fair comparison and a better visualization, we scale these weights in the interval [0, 1] as follows: αm/∑i=1Mαm. Figure 2 gives the Fmax values on the eight individual networks of the Human dataset (annotated with the BP labels), and the optimized weights on these networks. The corresponding results with respect to MacroF1 and fAUC are reported in Figures 3 and 4 of the Additional File 1. We also provide the results on the Human dataset (annotated with the CC and the MF labels) in the Additional File 1 (see Figures 5 and 6). Figure 2 Network relevance estimation using MNet, SW, and ProMK on the Human dataset annotated with BP functions. For each group of bars, the left one shows the Fmax value on the individual network, and the right one gives the weight assigned to the same network. From the Figure, we can observe that all three algorithms achieve the largest Fmax value on the 6-th network, and the Fmax value on each individual network has a similar rank among the eight individual networks across the different methods, i.e., the Fmax value on the 1st network ranks second according to MNet, SW, and ProMK. MNet assigns a larger weight on the 6-th network as compared to the weights for the other networks. In contrast, neither SW nor ProMK assigns the largest weight to the 6-th network. MNet, SW, and ProMK give the smallest weight to the 8-th network, though these methods do not produce the lowest Fmax for the 8-th network. The reason is that the 8-th network produces rather large smoothness loss values as compared to those of the other networks. Since λ2 is given a large value, ProMK assigns nearly equal weights to the first 7 networks. Because the smoothness loss value on the 8-th network is much larger than for the others, ProMK assigns zero weight to the 8-th network. Note that for small λ2 values, ProMK can only use one network and produces deteriorated results (see our parameter analysis in the next subsection). The Fmax values on the first three networks progressively decrease, and the weights assigned by MNet and SW to these networks also decrease. In contrast, the weights assigned by ProMK do not follow this trend. ProMK assigns larger weights to the 2nd and 3rd networks. The Fmax values on the next three (4-th, 5-th, and 6-th) networks, as well as the weights assigned by MNet, progressively increase, but the weight assigned by SW to the 4-th network is larger than those assigned to the 5-th and 6-th networks, and the weights assigned by ProMK progressively decrease. All these three methods give a smaller Fmax value to the 7-th network than to the 6-th; both MNet and SW assign a smaller weight to the 7-th network than to the 6-th, but ProMK assigns a larger weight to the 7-th network than to the 6-th. ProMK, OMG and LIG use only the smoothness loss to assign weights to the individual networks. The smaller the value of the smoothness loss for a network is, the larger the weight assigned to it is. The value of the smoothness loss of ProMK on the 3rd network is smaller than the values of the other networks, thus ProMK assigns a weight to the 3rd network that is larger than the ones assigned to other networks. However, the value of Fmax of this network is the lowest. This conflictual scenario shows that assigning a weight to a network merely based on the smoothness loss is not always reasonable. This justifies our motivation to unifying the kernel target alignment with the loss of classifier in one objective function, and also provides evidence as for why MNet works better than the other algorithms. These observations also apply to the results provided in the Additional File 1. Another interesting observation for Figure 6 in the Additional File 1 is that MNet, SW, and ProMK give the highest Fmax value to the 1st network of the Human dataset (annotated with MF functions), instead of to the 6-th network. In the Human dataset, the 1st network is derived from protein domain composition and the 6-th is a PPI network. This observation supports the statement that different data sources have different correlation with the GO terms. Lan et al. [7] also observed that the prediction of MF functions using sequence similarity is more accurate than that based on PPI information, and the prediction of BP functions based on PPI networks is more reliable than that based on sequence similarity. Regardless of this difference for the proteins of Human annotated with MF functions, MNet shows similar trends for the weight and the Fmax values assigned to the individual networks. In contrast, neither SW nor ProMK manifests such behavior. If we take the Fmax value of an individual network as its quality, we can conclude that MNet can assign weights to the individual networks that are proportional to their quality, whereas SW and ProMK cannot. This observation also helps us understand why MNet achieves a performance that is better than that of SW and ProMK. Parameter sensitivity Some of the algorithms used for comparison need to tune several parameters, and the specification of these parameters affect the performance. These parameters and their suggested ranges are listed in Table 1 of the Additional File 1. The result of MNet depends on λ, whose purpose is to balance the kernel target alignment and the loss of the classifier on the composite network. ProMK relies on the specification of λ2 to determine the weights on individual networks. OMG needs to tune the power size r on the weights and LIG requires to set the number of subnetworks for each input network. To study the parameter sensitive of these algorithms, for MNet, we vary λ in {10-2, 10-1,⋯,105}; for ProMK, we vary λ2 in {100, 101,⋯,107}; for OMG, we vary r in {1.2, 1.5, 2, 3, 4, 5, 6}, and for LIG, we vary C in {1, 5, 10, 20, 30}. For each setting value of the parameter, we execute five-fold cross validation as in the previous experiment, and report the average result. The results of these methods on Yeast (annotated with BP functions) under different values of the parameters are reported in Figure 3. We also provide similar results (Yeast annotated with CC and BP functions, and Human annotated with BP functions) in Figures 7-9 of the Additional File 1. Figure 3 Parameter sensitivity of MNet, ProMK, OMG, and LIG on the Yeast dataset annotated with BP functions. When λ is set to a small value (i.e., λ = 10-2), a small emphasis is put on the classification task and a large stress on the kernel target alignment; as such, the results of MNet slightly deteriorate. These results also support our statement that optimizing the kernel target alignment (or composite network) does not necessarily result in the optimal composite network for classification. When λ = 1 or above, MNet has a stable performance and outperforms the other methods. This trend also justifies our motivation to unifying the kernel target alignment and the classifier on the composite network in a combined objective function. When λ2 is small, only one network can be chosen by ProMK, and therefore ProMK achieves a relatively poor result in this case. When λ2 increases to a value larger than 103, more kernels are used to construct the composite network, and the results of ProMK progressively improve and achieve stability when most of the networks are used to construct the composite one. The best setting of r for OMG is r = 1.5; for larger values, the results worsen and they become stable when r ≤ 3. As for LIG, the values C = 1 and C = 5 often give the best results, and LIG's performance sometimes fluctuates sharply for other settings of C. From these results, we can draw the conclusion that MNet can select an effective λ's value from a wide range, and MNet is less affected by the parameter selection problem than ProMK and the other competitive algorithms. Runtime analysis We also recorded the running times of MNet and the other comparing methods on the Yeast and Human datasets. The results are given in Table 2. All the methods are implemented in Matlab (R2011a 64-bit). The specification of the experiment platform is: CentOS 5.6, Intel Xeon X5650 and 32GB RAM. From Table 2, we can observe that MNet often takes more time than the other methods. As the number of functional labels reduces, the runtime cost of MNet decreases sharply. The reason is that MNet has to compute the trace norm not only for individual networks, but also for the pairwise networks (see Eq. (7)). In contrast, ProMK, OMG, and LIG only compute the trace norm for individual networks. The running time of MNet is often no more than M (the number of individual networks) times the cost of ProMK, which is consistent with our previous complexity analysis. MSkNN does not learn weights on individual networks; as such it always runs faster than the other methods. SW first applies kernel target alignment to fuse multiple networks into a composite one, and then predicts protein functions using the composite network; it often ranks second (from fastest to lowest) among the comparing methods. Both ProMK and OMG iteratively optimize the weights on individual networks; they have similar runtime costs and lose to SW and MSkNN. LIG takes more time than the other methods; sometimes it is also slower than MNet. The reason is that LIG applies time-consuming eigen-decomposition for soft spectral clustering to divide each individual network into several subnetworks, and then combines these subnetworks into a composite one for function prediction. Given the superior effectiveness of MNet, it is desirable to use MNet to integrate multiple networks for protein function prediction. However, seeking efficient and effective ways to utilize multiple networks for function prediction remains an important research direction to explore.