PMC:4331678 / 12576-16726
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"25707434-10802651-14839071","span":{"begin":787,"end":789},"obj":"10802651"},{"id":"25707434-11101803-14839072","span":{"begin":1439,"end":1441},"obj":"11101803"},{"id":"25707434-19574621-14839073","span":{"begin":1629,"end":1631},"obj":"19574621"},{"id":"25707434-11101803-14839074","span":{"begin":2530,"end":2532},"obj":"11101803"},{"id":"25707434-14990442-14839075","span":{"begin":2607,"end":2609},"obj":"14990442"},{"id":"25707434-10802651-14839076","span":{"begin":3501,"end":3503},"obj":"10802651"},{"id":"25707434-20479498-14839077","span":{"begin":3504,"end":3506},"obj":"20479498"}],"text":"Network-based prediction algorithm\nLet Wm∈ℝn×n(m∈{1,2,…,M}) be a weight matrix corresponding to the m-th individual functional association network. Each node of a network corresponds to one of the n proteins, and the entry Wi,jm≥0 is the association (similarity, or reliability of interaction) between proteins i and j in the m-th data source. Among the n proteins, the first l proteins have confirmed annotation, and the functional annotation of the remaining u = n - l proteins needs to be predicted. These annotated proteins have C distinct functions, and each annotated protein has a subset of the C functions. Each of these C functions corresponds to a Gene Ontology (GO) term in one of the three sub-branches (Biological Process, Molecular Function, Cellular Component) of the GO [30]. The functions of the i-th protein is represented as a label vector yi ∈ {0|1}C, where yic = 1 if the i-th protein is confirmed to have the c-th function, otherwise, yic = 0. For an unlabeled protein j, yjc = 0 (l \u003cj ≤ n, 1 ≤ c ≤ C). Here, we denote the predicted likelihood vector of the i-th protein as fi ∈ RC , and fic represents the likelihood that the i-th protein has the c-th function.\nLet W=∑m=1MαmWm be the association matrix of the composite network, where αm ≥ 0 is the weight assigned to the m-th network, representing its relevance towards the prediction task. Many network-based algorithms [13] extend the guilt-by-association rule [31], or exploit the cluster structure of protein complex [4] to predict protein functions using the PPI networks. Inspired by these work, we use a linear neighborhood propagation algorithm [32] on the composite network W to predict protein functions:\n(1) f = arg min f ∑ i = 1 l | | f i - y i | | 2 2 + ∑ i = 1 n | | f i - ∑ j ∈ N ( i ) W i j f j | | 2 2 s . t . ∑ j = 1 n W i j = 1\nwhere N(i) is the set of proteins that have connections with the i-th protein, 0 ≤ Wij ≤ 1 is the (i, j)-th entry of W , and I is an n × n identity matrix. The first term in Eq. (1) enforces the prediction to be close to the initial annotation of the l proteins, and it is often viewed as the empirical loss term. The minimization of the second term enforces that the functions assigned to an unlabeled protein j are determined by the functions of its connected proteins in W; as such the second term acts as a smoothness loss term [33]. Eq. (1) is motivated by the observation that interacting proteins are more likely to share similar functions [31], and two proteins with similar amino acids often have similar functions [17]. The above equation can be rewritten in matrix notation as:\n(2) F = arg min F t r ( ( F - Y ) T H ( F - Y ) ) + t r ( F T L F )\nHere, Y = [y1, y2,⋯,yn], F = [f1, f2,⋯,fn] ∈ Rn × C, H is an n × n diagonal matrix with Hii = 1 if i ≤ l, and Hii = 0 otherwise, L = (I - W )T (I - W) and I is an n × n identity matrix, tr(·) and T are the matrix trace and transpose operators, respectively. By taking the differentiation of Eq. (2) to F and setting the differentiation to zero, F can be computed as:\n(3) F = ( H + L ) - 1 H Y\nThe functional labels are organized in a hierarchy (a directed acyclic graph for GO terms, and a tree for MIPS FunCat). The more specific the functional label is in the hierarchy, the fewer member proteins this label has. If a protein has a confirmed functional label, this protein is also annotated with its ancestor labels in the hierarchy [3,30,34]. Therefore, protein function prediction is an unbalanced classification problem and to achieve a good prediction it's important to take into account this issue [3,11]. Eq. (1) ignores the unbalanced problem and considers all functional labels as equal. To address this limitation, we modify yic into y˜ic=yiclogNnc+ (N=∑c=1Cnc+,nc+ proteins annotated with the c-th function). The added factor has the effect of putting more emphasis on functional labels that are more specific. This forces the optimizer to focus on the more specific functions, versus the general ones which have more member proteins. We set Y˜=[y˜1,⋯,y˜n], and F=(H+L)-1HY˜."}