Method Network-based prediction algorithm Let Wm∈ℝn×n(m∈{1,2,…,M}) be a weight matrix corresponding to the m-th individual functional association network. Each node of a network corresponds to one of the n proteins, and the entry Wi,jm≥0 is the association (similarity, or reliability of interaction) between proteins i and j in the m-th data source. Among the n proteins, the first l proteins have confirmed annotation, and the functional annotation of the remaining u = n - l proteins needs to be predicted. These annotated proteins have C distinct functions, and each annotated protein has a subset of the C functions. Each of these C functions corresponds to a Gene Ontology (GO) term in one of the three sub-branches (Biological Process, Molecular Function, Cellular Component) of the GO [30]. The functions of the i-th protein is represented as a label vector yi ∈ {0|1}C, where yic = 1 if the i-th protein is confirmed to have the c-th function, otherwise, yic = 0. For an unlabeled protein j, yjc = 0 (l nc+nc-l2. From the definition, the more functions two proteins have in commom, the larger the entry (corresponding to the weight of the edge between them) in the target network is. This idea was adapted to define the target network [15,16] and to reconstruct the functional association network [36]. Mostafavi et al. [15,16] set the entry (corresponding to the edge between two proteins such that one has the c-th function and the other doesn't) in the target network as -nc+nc-n2. In contrast, we set the entry as nc+nc-l2. The reason is that the available GO term annotation of proteins is incomplete, is updated regularly and suffer from a large research bias [3,21,37]. As such, yic = 0 should not be simply interpreted as if the i-th protein does not have the c-th function. Furthermore, for a to be predicted protein j, if W(i; j) is large, from the guilty by association rule, protein j is likely to share some functions with the i-th protein. By minimizing Eq. (4), we aim at crediting larger weights to the networks which consider highly similar proteins which share more functions, and smaller weights to networks which consider highly similar proteins which share fewer or no functions. By doing so, we can assign larger weights to networks that are coherent with functional labels. Based on the fact that tr(KW) = vec(K)T vec(W), where vec(K) is the vectorization operator that stacks the columns of K on top of each other, we can rewrite Eq. (4) as a non-negative linear regression problem: (5) α = arg min α t r ( ( V K - V W α ) T ( V K - V W α ) ) s . t .   α m ≥ 0 , 1 ≤ m ≤ M where VK = vec(K), VW=[vec(W1),⋯,vec(WM)]∈ℝ(n×n)×M. The unified objective function Mostafavi et al. [15,16] first utilized Eq. (4) to determine α for the individual networks, and then applied a Gaussian Random Field classifier [33] on the composite network W to predict protein functions. However, the composite network W optimized from the target network K is not necessarily optimal for the classifier, since the optimization of the composite network and of the classifier is divided into two separate objectives. To avoid this limitation, we attempt to integrate these two objectives into a unified function as follows: (6) α = arg min α t r ( ( V K - ∑ m = 1 M α m v W m ) T ( ( V K - ∑ m = 1 M α m v W m ) ) + λ t r ( F T ( I - ∑ m = 1 M α m W m ) T ( I - ∑ m = 1 M α m W m ) F ) + λ t r ( ( F - Y ˜ ) T H ( F - Y ˜ ) ) s . t .   α m ≥ 0 , 1 ≤ m ≤ M where vWm is the m-th column vector of VW. Eq. (6) combines the objectives of network-based classification and of target network alignment. Therefore, we can enforce the composite network to be coherently optimal with respect to both objectives. The above objective function is convex with respect to F and α, respectively. Here, we introduce an EM [38] style algorithm to iteratively optimize F (or α) while keeping α (or F) constant. For a fixed α, we can obtain F directly from Eq. (2). For a fixed F, using tr(αmW) = αmtr(W) and tr(K - W) = tr(K) - tr(W), Eq. (7) can be rewritten as: (7) α = arg min α - 2 α T V W T V K + α T V W T V K α + λ ( - 2 α T μ + α T Θ α ) s . t .   α m ≥ 0 , 1 ≤ m ≤ M where Θ is an M × M matrix with Θ(m1,m2)=tr(FT(Wm1)TWm2F), μ is an M × 1 vector with μm = tr(FTWmF). The other terms in Eq. (6) are constant for a fixed F and irrelevant for α, thus they are not included in Eq. (7). Taking the partial derivatives with respect to α, we can obtain the following solution: (8) α = ( V W T V W + λ Θ ) - 1 ( V W T V K + λ μ ) It is easy to see that, if λ = 0, only the kernel target alignment criterion is used to optimize α, and MNet is similar to SW. The MNet algorithm is presented in Algorithm 1. Ft and αt are the computed values for F and α at the t-th iteration, maxIter and θ are the maximum number of iterations and the convergence threshold, respectively. Algorithm 1 MNet: Integrating Multiple Networks for Protein Function Prediction Input:    {Wm}m=1M functional association networks, Y˜, λ, maxIter, θ Output: Predicted likelihood score vectors {fj}j=l+1n 1: Initialize αm1=1(1≤m≤M), and t = 1 2: while t θ do 3:   t = t + 1 4:   Compute Ft using Eq. (3) 5:   Compute αt using Eq. (8) 6: δ = |αt - αt-1| 7: end while Complexity analysis Several components in Eq. (8) (i.e., VWTVW∈RM×M, VWTVK∈RM×1 and (Wm1)TWm2∈Rn×n) can be computed before the iterative process. The time complexity of tr(FTWF ) is O(Cn2). Θ is an M × M symmetric matrix, in each iteration there are M (M + 1)/2 elements to be computed, so the time complexity of computing Θ is O(M (M + 1) × Cn2). The complexity of matrix inverse in Eq. (3) is O(n3). To avoid large matrix inverse, iterative approximation algorithms (i.e. Conjugate Gradient) can be applied. Since the computation complexity of matrix inverse in Eq. (8), and the complexity of μ is smaller than Θ, the overall time complexity of MNet is max{O(M2TCn2), O(Tn3)}. T is the number of iterations for convergency. In practice, T is no more than 10. In our study, the association matrices of the individual networks and the composite network are all sparse with O(n) non-zero elements. For this reason, the time complexity of the above operations can be greatly reduced. The main difference between MNet and ProMK is that ProMK just uses μ, so its time complexity is max{O(MTCn2), O(Tn3)}. Given that the number of individual networks is much smaller than the number of proteins and functional labels, MNet has similar complexity with ProMK.