testing  

@ewha-bio:55 JSONTXT

Interpretation of Association Networks among Protein Sequence Motifs Every protein can be characterized by either a distinct motif or a combination of motifs. Nevertheless, little is known about the relationships among (more than two) the motifs. Some of the proteins in the world are share motifs for evolutional or other biological benefits - they can save energy, time and resource for controlling and managing a variety of proteins. In some cases of motifs, the tendency is quite common and they can act the ‘hub’ motif of a network of the motif associations. The hubs are structurally and functionally important in themselves and also important in disease-related mutations. They will be highly resistant mutation to conserve their functions. But, in case of the a rare mutation, mutations on the position of hub can more easily cause fatal diseases. To understand the biological activities of a cell, it is essential to characterize the functions of the proteins in that specific cell. From a particular function of a protein, there can be complex biological processes. Among many approaches for protein characterization, there is a method using the proteins’ basic building units - amino acids. As the size of protein sequence database grows, it becomes more powerful to predict proteins’ functions by sequence comparison and there have been those kinds of works. In this paper, we used the protein sequence motifs that are representative features of a protein and their functions rather than comparing whole protein sequences (we are going to use the term ‘motif as a concept that includes ‘functional domain’). Every protein can be characterized by either a distinct motif or a combination of motifs. And, to understand motif association is crucial for understanding the nature and the extent of biochemical functions (Deng etal., 2002). Moreover, motif combinations are interesting because combining with other motifs, a motif can show more complicated and specified function to a protein. Nevertheless, little is known about the relationships among (more than two) the motifs. There are several motif finding algorithms and protein motif databases such as PROSITE (Flaquet et al., 2002), PRINTS (Attwood et al., 2003), PFAM (Bateman etal., 2002) and SMART (Letunic etal., 2002) using those algorithms. In this paper, we use d the InterPro database (Mulder et al., 2003) that provides integrated and non-redundant protein-to-motif data and found association rules on the motifs using a data mining algorithm. Some of the protein in the world share motifs for evolutional or other biological benefits - they can save energy, time and resource for controlling and managing a variety of proteins. In some of the motifs, the tendency of sharing is quite common and they can act as the ‘hub’ motif for a network of motif associations. The hubs are structurally and functionally important in themselves and are also important in disease-related mutations. They are highly resistant to conserve their functions. But, in the case of a rare mutation on the position of hub can more easily cause fatal diseases. In this paper, we obtained association rules among the protein sequence motifs, and the found hub motifs from the networks that were construct from the rules. After that, we examined structures of the motif-containing proteins and searched the disease-related databases. Association rule mining finds interesting association or correlation relationships among a large set of data items. With massive amounts of data continuously being collected and stored, many industries are becoming interested in mining association rules from their databases. A typical example of association rule mining is market basket analysis. Let J - {ii,i2,---i m } be a set of items (Han and Kamber, 2001). Let D, the task-relevant data, be a set of database transaction where each transaction T is a set of items such that TQJ. Each transaction is associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if and only if AQT. An association rule is an implication of the form A=^B, where AcJ, BcJ, and AC\B= 0. The rule A=$B holds in the transaction set D with support 5, where s is the percentage of transactions in D that contain AUB (i.e., both A and B). This is taken to be the probability, P(A \JB). The rule A—>B has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P(B\A). That is, Association rule mined using a support-confidence framework are useful for many applications. However, the support-confidence framework can be misleading in that it may identify a rule A^B as interesting when, in fact, the occurrence of A does not imply the occurrence of B. The occurrence of itemset A is independent of the occurrence of itemset B if P(ACB)= P(A)P(B); otherwise itemset A and B are dependent and correlated as events. This definition can easily be extended to more than two itemsets. The correlation (‘lift’ in this paper) between the occurrence of A and B can be measured by computing, Lift = Corr A,B = P(A \JB)/(P(A)*P(B)) interesting if they satisfy both a minimum support and a minimum confidence threshold. And more over, the values make it possible to obtain the non-deterministic ‘tendency’ of rules that can be opportunities for finding new information. Mining association is also different from the correlation study. Association rules are built based on pre-measured correlation and given statistical meaning with those values like support and confidence. Association rules can include all of the information in correlations with the ‘lift’ value and can reproduce the correlations of themselves. And, the association rules can provide new information about the relationships of motifs. We used the Swiss-Prot protein sequence database (v41.0) and the InterPro database (v6.0). The InterPro database has the integrated and non-redundant protein-to-motif data (Table 1). For extracting the associations among the protein motifs, we used IBM Intelligent Miner for Data. The protein-to-motif data were built from the InterPro database with IDs to DB2. And, we set the minimum threshold values both a minimum support (0.06%) and a minimum confidence (50%). Here is an association rule simplified form of the real association rules of the motif data; mk+i. Recently, there have been some approaches in biological fields with association rule mining. The subjects are protein-protein interaction, micro-array data analysis, literature-based gene identification and intra-protein motifs like ours (Horng et al., 2002). The intra-protein motif study was actually a correlation study. The association rules did not shown in the paper and there wasn’t interesting analysis about the causal correlations, neither. Knowledge discovery through the association rule mining method is quite different from the classification method. Though the rules that are generated from a decision tree can be look like the general association rules, there are more than two statistical values only for association rules. As we mentioned before, association rules are considered Symbol mi means a protein motif, and the arrow shows a subordinate relationships in math. Therefore, the symbolized association rule, -► /m +/ , has meaning; “if one or more motif(s) mi, m 2 , •••, m k are in a protein, then the motif mwwould exist in protein.” There were 141 networks that satisfied both a minimum support (0.06%) and a minimum confidence (50%) threshold. The size of network varied from group to group and the largest one had 38 motifs, the smallest one had only 2 motifs. From them we selected the largest one (with 38 motif set) as example of these networks. More than 2,400 rules were consisted that network and we picked a sub-network by choosing a herb motif randomly out for more sophisticate study. The sub-network was composed of 15 motifs and the hub of network was [IPR003593]: AAA ATPase. The number of underlining association rules in the network were 21 (Fig. 1,2). InterPro is an integrated documentation resource for protein families, domains and sites. InterPro combines number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on the well-characterized proteins to derive the protein signatures. By uniting the member databases, InterPro capitalizes on their individual strengths, producing a powerful integrated diagnostic tool (Table 2). [IPR001270], [IPR003960], [IPR001984], [IPR003439], [IPR002197], [IPR001140] with [IPR00353439], [IPR002197], [IPR001140] with [IPR003593] In InterPro motif database, there were relations among sereral motifs: PARENT/CHI LD and CONTAINS/FOUND IN. The ‘parent/child’ relationship is used to indicate true protein family/subfamily relationships. In all cases a protein sequence match to the child entry implies a match to the parent and signatures for the parent and child entries must overlap. And, the contains/found in’ relationship is used to indicate domain composition. Some domains can be found in more than one type of protein or family of proteins, but is not a subtype in the family sense. We identified four overlapped motifs: [IPR001270], [IPR003960], [IPR0019 84], [IPR003439] with [IPR0035 93]. And they represented similar function, ‘ATP-binding’ which is a kind of nucleotide. Moreover two more functional links were identified: [IPR002197], [IPR001140] with [IPR003593]. [IPR002197] stands for helix-turn-helix, which binds to DNA. And [IPR001140] is ABC transporter trans­ membrane region motif. [IPR003593] is an integrated motif from the SMART database to InterPro. It indicates [SM00382] in the SMART motif database. From the structural information in SMART, we choose a structure (PDB id: 1BMF) that is obtained from an ATPase of bovine. We mapped [IPR003593] ([SM000382]) into that structure using NCBI Cn3D tool and got the site of that hub motif (Fig.3). SMART database also provides OMIM curated human diseases associated with missense mutations within the AAA domain (IPR003593). There were 12 human proteins and 28 OMIM entries that are related to this hub motif (Table 3). As we mentioned, the proteins in the world share motifs for evolutional or other biological benefits - they can save energy, time and resource for controlling and managing a variety of proteins. The hubs are structurally and functionally important in themselves and are also important in disease-related mutations. They are highly resistant to mutation to conserve their functions. But in the case of a rare mutation, on the position of hub can more easily cause fatal diseases. From an example of network, we knew that there could be several ‘hub’ motifs. The hub motifs act ‘hubs’ in the computer network system and are important in the aspect of structure and function of the proteins. As we have shown, the mutations (ex. missense mutation in OMIM) in these hub motifs can cause serious biological malfunction or disease to whole organism. The major problem of this kind of network building is the redundancy control. The traditional association rule mining framework produces many redundant rules. So, we are trying to solve this problem and are planning to build non- redundant networks among motifs using that solution. One of the other problems, there are reasonable threshold values. We used a minimum support 0.06% and a minimum confidence 50% threshold. However, various threshold values can be set according to the experimental purpose and many other forms of networks with different size and components. This work was supported by Ministry of Science and Technology R&D Program. We would like to thank IBM Shard University Research (SUR) program, and CHUNG MoonSoul Center for Bio-Information and Bio-Electronics for providing computing facility used in this work.

Annnotations TAB TSV DIC JSON TextAE Lectin_function IAV-Glycan

  • Denotations: 0
  • Blocks: 0
  • Relations: 0