Association Rule Based Classification The problem of predicting PPI types for a given complex of binary proteins is transformed into the task of assigning a pre-determined target class (i.e., homo/hetero obligate and non-obligate) using properties of interaction sites. We applied an efficient association rules based classification method (ARBC) to perform classification based on rules generated by Association Rule Mining (ARM). Previous studies [21,22] have proposed that ARBC consistently outperforms other rule-based classifiers such as decision trees. ARBC comprises three main steps: association rule generation, pruning association rules and classification based on association rules. Association rule generation In our approach we employed Association Rule Mining to discover a set of frequent patterns expressed as association rules describing the relationship between properties of PPI interaction sites and PPI types. Association rules have the form R: X → Y [c, s], where X and Y are the body and the head of the rule respectively. X and Y are disjoint predicates (X ∩ Y = ϕ). Each X and Y consists of a conjunction of distinct predicates which describe properties related to interaction sites. Note that we can consider a conjunction as a set for our purposes. In our approach, the heads of all rules Y are restricted to be one of the PPI types considered which are the target classes defined in this task. The strength of the association rules can be measured in terms of their support (s) and confidence (c). The support of a rule (X → Y) is the probability that the cases in a database contain both X and Y. The confidence of the rule is the probability that a case contains Y given that it contains X. The generation of association rules was carried out employing the APRIORI algorithm [29]. We used the 10 g Oracle Data Miner (ODM) software which implements the APRIORI algorithm to compute the type of association rules required for our ARBC approach. We set a minimum support and confidence of 3% and 25% respectively to reduce the number of association rules generated. Association mining is not directly applicable to real valued continuous data such as some of the dom-face properties we generated. Hence we used discretisation to manipulate continuous attributes before the ARM process was executed. In this process adjacent values of continuous data were binned into a finite number of intervals. Pruning association rules The number of rules generated by ARM can be very large. It is necessary to prune the set of association rules by removing redundant information in order to make the classification more efficient. Given two rules R1: X1 → Y1 and R2: X2 → Y2, we define: Definition 1. The significance of a rule: R1 is more significant than R2 if and only if either (1) conf (R1) > conf (R2) or (2) conf (R1) = conf (R2) but sup(R1) > sup(R2) or (3) R1 has fewer attributes in its left hand side than R2 ◇ Definition 2. General rule: Given two rules R1: X1 → Y1 and R2: X2 → Y2, R1 is a general rule if and only if X1 ⊆ X2 ◇ Definition 3. Overlapping rule: Given two rules R1: X1 → Y1 and R2: X2 → Y2, then R3: X1 ∨ X2 → Y1(conf (R1), sup(R1)) ∨ Y2(conf (R2), sup(R2)) is an overlapping rule if and only if X1 = X2 and Y1 ≠ Y2 ◇ If the body of a rule R1 is identical to the body of a rule R2 and the head of rule R1 is inconsistent with that of rule R2, then an overlapping rule R3 between two different PPI types can be identified. Overlapping rules can be considered as common rules between two or more PPI types. On the other hand unique rules are distinctive patterns which can be used to classify interaction sites into different PPI types. We then evaluated the following condition in order to prune the set of association rules previously generated. Given two rules R1 and R2, where R1 is a general rule w.r.t. R2, ARBC eliminate R2 if R1 has more significance than R2. Sets of unique and overlapping rules were generated with the pruning procedure used in the classification. Classification In the classification step we employed the pruned set of unique and overlapping rules to generate a rule profile consisting of an m × n matrix, where m is the number of examples (i.e. dom-faces) and n is the number of different association rules obtained after the pruning step. Each row of this matrix represents one of the dom-faces considered in our research and is associated with one of the PPI types we wish to classify. The rule profile matrix takes values of 1 or 0 depending whether the different rules are contingent or not on the respective dom-face example. A similar approach was previously employed in [30] for protein structure comparison. The rule profile matrix was generated following Algorithm 1 and then used as input to the ARBC process. Algorithm 1 Generation of a rule profile Input:   A set of rules (R1, ⋯, Rn) and       A set of training data comprising m objects (O1,⋯, Om) Output:   An m × n matrix, RProfile(i, j)(1 ≤ i ≤ m and 1 ≤ j ≤ n) Method: 1.            Sort rules in the descending order of confidence and support 2.            for each rule Rj in the descending order of the rules      for each data object Oi in the training data           find match between Oi and rule Rj                if match(Oi, Rj)                     set RProfile(i, j) = 1                else                     set RProfile(i, j) = 0           end-for      end-for We evaluated several classification techniques for this task including Decision Trees (DT), Random Forest (RF), K Nearest Neighbor (KNN), Support Vector Machines (SVM), and Naive Bayes (NB). The WEKA machine learning library [31] was used to perform these experiments. We also performed conventional classification based only on the physicochemical properties of the different dom-faces examples, without generating a set of association rules (CWAR). This was done in order to evaluate if the employment of the ARBC approach could be associated with a loss of information of some interacting complexes due, for example, to the pruning step or the discretisation of continuous value feature information. In all cases a 10 fold cross validation procedure was performed. Because the task of classification of different PPI types involves imbalanced classes (see Table 1) we utilized an over-sampling strategy, incrementing the number of instances associated with those PPI types with few examples.