Classification
In the classification step we employed the pruned set of unique and overlapping rules to generate a rule profile consisting of an m × n matrix, where m is the number of examples (i.e. dom-faces) and n is the number of different association rules obtained after the pruning step. Each row of this matrix represents one of the dom-faces considered in our research and is associated with one of the PPI types we wish to classify. The rule profile matrix takes values of 1 or 0 depending whether the different rules are contingent or not on the respective dom-face example. A similar approach was previously employed in [30] for protein structure comparison. The rule profile matrix was generated following Algorithm 1 and then used as input to the ARBC process.
Algorithm 1 Generation of a rule profile
Input:   A set of rules (R1, ⋯, Rn) and
      A set of training data comprising m objects (O1,⋯, Om)
Output:   An m × n matrix, RProfile(i, j)(1 ≤ i ≤ m and 1 ≤ j ≤ n)
Method:
1.            Sort rules in the descending order of confidence and support
2.            for each rule Rj in the descending order of the rules
     for each data object Oi in the training data
          find match between Oi and rule Rj
               if match(Oi, Rj)
                    set RProfile(i, j) = 1
               else
                    set RProfile(i, j) = 0
          end-for
     end-for
We evaluated several classification techniques for this task including Decision Trees (DT), Random Forest (RF), K Nearest Neighbor (KNN), Support Vector Machines (SVM), and Naive Bayes (NB). The WEKA machine learning library [31] was used to perform these experiments. We also performed conventional classification based only on the physicochemical properties of the different dom-faces examples, without generating a set of association rules (CWAR). This was done in order to evaluate if the employment of the ARBC approach could be associated with a loss of information of some interacting complexes due, for example, to the pruning step or the discretisation of continuous value feature information. In all cases a 10 fold cross validation procedure was performed. Because the task of classification of different PPI types involves imbalanced classes (see Table 1) we utilized an over-sampling strategy, incrementing the number of instances associated with those PPI types with few examples.