Method

Network-based prediction algorithm
Let Wm∈ℝn×n(m∈{1,2,…,M}) be a weight matrix corresponding to the m-th individual functional association network. Each node of a network corresponds to one of the n proteins, and the entry Wi,jm≥0 is the association (similarity, or reliability of interaction) between proteins i and j in the m-th data source. Among the n proteins, the first l proteins have confirmed annotation, and the functional annotation of the remaining u = n - l proteins needs to be predicted. These annotated proteins have C distinct functions, and each annotated protein has a subset of the C functions. Each of these C functions corresponds to a Gene Ontology (GO) term in one of the three sub-branches (Biological Process, Molecular Function, Cellular Component) of the GO [30]. The functions of the i-th protein is represented as a label vector yi ∈ {0|1}C, where yic = 1 if the i-th protein is confirmed to have the c-th function, otherwise, yic = 0. For an unlabeled protein j, yjc = 0 (l <j ≤ n, 1 ≤ c ≤ C). Here, we denote the predicted likelihood vector of the i-th protein as fi ∈ RC , and fic represents the likelihood that the i-th protein has the c-th function.
Let W=∑m=1MαmWm be the association matrix of the composite network, where αm ≥ 0 is the weight assigned to the m-th network, representing its relevance towards the prediction task. Many network-based algorithms [13] extend the guilt-by-association rule [31], or exploit the cluster structure of protein complex [4] to predict protein functions using the PPI networks. Inspired by these work, we use a linear neighborhood propagation algorithm [32] on the composite network W to predict protein functions:
(1)  f  = arg  min  f    ∑  i = 1  l    | | f   i   - y   i   | |  2  2   + ∑  i = 1  n    | | f   i   - ∑  j ∈ N ( i  )     W  i j   f   j   | |  2  2     s . t . ∑  j = 1  n    W  i j   = 1
where N(i) is the set of proteins that have connections with the i-th protein, 0 ≤ Wij ≤ 1 is the (i, j)-th entry of W , and I is an n × n identity matrix. The first term in Eq. (1) enforces the prediction to be close to the initial annotation of the l proteins, and it is often viewed as the empirical loss term. The minimization of the second term enforces that the functions assigned to an unlabeled protein j are determined by the functions of its connected proteins in W; as such the second term acts as a smoothness loss term [33]. Eq. (1) is motivated by the observation that interacting proteins are more likely to share similar functions [31], and two proteins with similar amino acids often have similar functions [17]. The above equation can be rewritten in matrix notation as:
(2)  F = arg  min  F   t r ( ( F - Y  )   T   H ( F - Y  )   )  + t r ( F  T   L F  )
Here, Y = [y1, y2,⋯,yn], F = [f1, f2,⋯,fn] ∈ Rn × C, H is an n × n diagonal matrix with Hii = 1 if i ≤ l, and Hii = 0 otherwise, L = (I - W )T (I - W) and I is an n × n identity matrix, tr(·) and T are the matrix trace and transpose operators, respectively. By taking the differentiation of Eq. (2) to F and setting the differentiation to zero, F can be computed as:
(3)  F = ( H + L  )   - 1   H Y
The functional labels are organized in a hierarchy (a directed acyclic graph for GO terms, and a tree for MIPS FunCat). The more specific the functional label is in the hierarchy, the fewer member proteins this label has. If a protein has a confirmed functional label, this protein is also annotated with its ancestor labels in the hierarchy [3,30,34]. Therefore, protein function prediction is an unbalanced classification problem and to achieve a good prediction it's important to take into account this issue [3,11]. Eq. (1) ignores the unbalanced problem and considers all functional labels as equal. To address this limitation, we modify yic into y˜ic=yiclogNnc+ (N=∑c=1Cnc+,nc+ proteins annotated with the c-th function). The added factor has the effect of putting more emphasis on functional labels that are more specific. This forces the optimizer to focus on the more specific functions, versus the general ones which have more member proteins. We set Y˜=[y˜1,⋯,y˜n], and F=(H+L)-1HY˜.

Kernel target alignment
Given Wij=∑m=1MαmWijm and the available functional association networks {Wm}m=1M, the accuracy of protein function prediction is determined by α = [α1, α2,⋯,αM]. [24] and [35] have shown that the target aligned kernel (network) can boost the performance of kernel-based classification and regression. To compute the weights to be assigned to the M individual networks, we resort to a form of kernel-target alignment algorithm [24] as follows:
(4)  α = arg min  α   t r ( ( K - W  )   T   ( K - W  )   )    s . t . W = ∑  m = 1  M    α  m   W  m   , α  m   ≥ 0
where K∈ℝn×n is the induced target network of functional labels, defined as K=∑c=1CKc, where Kc is the c-th induced target network computed as:
K  c   ( i , j  )  = n  c  -    l      2    if  y  i c   = y  j c   = 1   n  c  +   n  c  -    l  2      if  y  i c   y  j c   = 0 & y  i c   + y  j c   = 1 & i , j ≤ 1   0 ,  otherwise
where nc- is the number of proteins which are not annotated with the c-th function. Since a functional label often has a relatively small number of member proteins nc+<nc- and nc-l2>nc+nc-l2. From the definition, the more functions two proteins have in commom, the larger the entry (corresponding to the weight of the edge between them) in the target network is. This idea was adapted to define the target network [15,16] and to reconstruct the functional association network [36]. Mostafavi et al. [15,16] set the entry (corresponding to the edge between two proteins such that one has the c-th function and the other doesn't) in the target network as -nc+nc-n2. In contrast, we set the entry as nc+nc-l2. The reason is that the available GO term annotation of proteins is incomplete, is updated regularly and suffer from a large research bias [3,21,37]. As such, yic = 0 should not be simply interpreted as if the i-th protein does not have the c-th function. Furthermore, for a to be predicted protein j, if W(i; j) is large, from the guilty by association rule, protein j is likely to share some functions with the i-th protein. By minimizing Eq. (4), we aim at crediting larger weights to the networks which consider highly similar proteins which share more functions, and smaller weights to networks which consider highly similar proteins which share fewer or no functions. By doing so, we can assign larger weights to networks that are coherent with functional labels.
Based on the fact that tr(KW) = vec(K)T vec(W), where vec(K) is the vectorization operator that stacks the columns of K on top of each other, we can rewrite Eq. (4) as a non-negative linear regression problem:
(5)  α = arg min  α   t r ( ( V  K   - V  W   α  )   T   ( V  K   - V  W   α  )   )    s . t .   α  m   ≥ 0 , 1 ≤ m ≤ M
where VK = vec(K), VW=[vec(W1),⋯,vec(WM)]∈ℝ(n×n)×M.

The unified objective function
Mostafavi et al. [15,16] first utilized Eq. (4) to determine α for the individual networks, and then applied a Gaussian Random Field classifier [33] on the composite network W to predict protein functions. However, the composite network W optimized from the target network K is not necessarily optimal for the classifier, since the optimization of the composite network and of the classifier is divided into two separate objectives. To avoid this limitation, we attempt to integrate these two objectives into a unified function as follows:
(6)  α = arg min  α   t r ( ( V  K   - ∑  m = 1  M    α  m   v   W  m    )   T   ( ( V  K   - ∑  m = 1  M    α  m   v   W  m    )   )      + λ t r ( F  T   ( I - ∑  m = 1  M    α  m   W  m    )   T   ( I - ∑  m = 1  M    α  m   W  m    )  F  )    + λ t r ( ( F - Y ˜   )   T   H ( F - Y ˜   )   )    s . t .   α  m   ≥ 0 , 1 ≤ m ≤ M
where vWm is the m-th column vector of VW. Eq. (6) combines the objectives of network-based classification and of target network alignment. Therefore, we can enforce the composite network to be coherently optimal with respect to both objectives.
The above objective function is convex with respect to F and α, respectively. Here, we introduce an EM [38] style algorithm to iteratively optimize F (or α) while keeping α (or F) constant. For a fixed α, we can obtain F directly from Eq. (2). For a fixed F, using tr(αmW) = αmtr(W) and tr(K - W) = tr(K) - tr(W), Eq. (7) can be rewritten as:
(7)  α = arg min  α   - 2 α  T   V  W  T   V  K   + α  T   V  W  T   V  K   α   + λ ( - 2 α  T   μ + α  T   Θ α  )    s . t .   α  m   ≥ 0 , 1 ≤ m ≤ M
where Θ is an M × M matrix with Θ(m1,m2)=tr(FT(Wm1)TWm2F), μ is an M × 1 vector with μm = tr(FTWmF). The other terms in Eq. (6) are constant for a fixed F and irrelevant for α, thus they are not included in Eq. (7). Taking the partial derivatives with respect to α, we can obtain the following solution:
(8)  α = ( V  W  T   V  W   + λ Θ  )   - 1   ( V  W  T   V  K   + λ μ  )
It is easy to see that, if λ = 0, only the kernel target alignment criterion is used to optimize α, and MNet is similar to SW.
The MNet algorithm is presented in Algorithm 1. Ft and αt are the computed values for F and α at the t-th iteration, maxIter and θ are the maximum number of iterations and the convergence threshold, respectively.
Algorithm 1 MNet: Integrating Multiple Networks for Protein Function Prediction
Input:
   {Wm}m=1M functional association networks, Y˜, λ, maxIter, θ
Output:
Predicted likelihood score vectors {fj}j=l+1n
1: Initialize αm1=1(1≤m≤M), and t = 1
2: while t <maxIter and |δ| >θ do
3:   t = t + 1
4:   Compute Ft using Eq. (3)
5:   Compute αt using Eq. (8)
6: δ = |αt - αt-1|
7: end while

Complexity analysis
Several components in Eq. (8) (i.e., VWTVW∈RM×M, VWTVK∈RM×1 and (Wm1)TWm2∈Rn×n) can be computed before the iterative process. The time complexity of tr(FTWF ) is O(Cn2). Θ is an M × M symmetric matrix, in each iteration there are M (M + 1)/2 elements to be computed, so the time complexity of computing Θ is O(M (M + 1) × Cn2). The complexity of matrix inverse in Eq. (3) is O(n3). To avoid large matrix inverse, iterative approximation algorithms (i.e. Conjugate Gradient) can be applied. Since the computation complexity of matrix inverse in Eq. (8), and the complexity of μ is smaller than Θ, the overall time complexity of MNet is max{O(M2TCn2), O(Tn3)}. T is the number of iterations for convergency. In practice, T is no more than 10. In our study, the association matrices of the individual networks and the composite network are all sparse with O(n) non-zero elements. For this reason, the time complexity of the above operations can be greatly reduced. The main difference between MNet and ProMK is that ProMK just uses μ, so its time complexity is max{O(MTCn2), O(Tn3)}. Given that the number of individual networks is much smaller than the number of proteins and functional labels, MNet has similar complexity with ProMK.