Classification of PPI types
We were able to discover a total of 1,168 rules describing associations by employing ARM. After the pruning stage a total of 157 association rules [see Additional file 3] were selected for the classification process. The number of rules associated with types ENZ, nonENZ, HET and HOM are 65, 49, 19, and 24 respectively (Table 3). A total of 58 of these are unique, i.e. rules exclusively associated with just one PPI type. The remaining 99 rules are overlapping (non-unique) rules related to two or more PPI types. We are interested in this distinction because unique rules appear to be related to specific characteristics of PPI types, whilst overlapping rules can be related to common attributes of different interaction types or for instance to distinctive properties between obligate and non-obligate interactions.
Table 3  The number of association rules discovered for each PPI type
Type  #. of Domainsa  #. of Rulesb  Unique Rulesc  Overlapping Rulesd
ENZ  49  65  34 (52.31%)  31 (47.69%)
nonENZ  47  49  16 (32.65%)  33 (67.35%)
HET  33  19  7 (36.84%)  12 (63.16%)
HOM  225  24  1 (4.17%)  23 (95.83%)
Total  354  157  58 (36.94%)  99 (63.06%)
a#. of Domains: A number of domains in each PPI type;
b#. of Rules: A number of association rules discovered for each PPI type;
cUnique Rules: A number of association rules associated with just one PPI type;
dOverlapping Rules: A number of rules of which bodies are identical to those of rules in other types. The performance for different classification methods measured as total accuracy over 10 fold cross validation for ARBC is shown in Table 4. Additionally we performed classification based on the physicochemical properties of the different dom-faces(CWAR), and also ARBC classification based on a rule profile generated using only the set of 58 unique rules discovered (UR). Performance results for these approaches are also given in Table 4. We have seen that in all these cases SVM exhibited the best performance among diverse classifiers studied, reaching over 99% accuracy in some cases. However this high accuracy suggests that overfitting problems are associated with the use of SVM. The other classification approaches evaluated still exhibit a high accuracy with the exception of NB. The performance reached by them is comparable to that previously reported in [12] although not exactly the same instances and features were employed. Additionally we observed that there was no significant appreciable difference between the performance of ARBC and CWAR in most of the situations, although it seems that CWAR performed slightly better than ARBC.
Table 4  Accuracy for difference classification methods
Method  a  DT  RF  KNN  SVM  NB
All data1:
ARBC  b  0.924  0.968  0.943  0.999  0.476
CW AR  c  0.926  0.971  0.978  0.999  0.531
UR  d  0.873  0.933  0.893  0.970  0.519
No SSE data2:
ARBC_WO_SSE  e  0.917  0.951  0.936  0.992  0.451
CW AR_WO_SSE  f  0.927  0.970  0.979  0.988  0.492
UR_WO_SSE  g  0.800  0.850  0.800  0.890  0.483
aMethod represents different classification methods such as Decision Tree (DT), Random Forest (RF), K Nearest Neighbor(KNN), Support Vector Machine (SVM) and Naive Bayes (NB);
bARBC: Association rule based classification;
cCW AR: Classification based on physicochemical properties;
dUR: ARBC classification using 58 unique association rules;
e, f, g: Data sets with exclusion of SSE content from All data1;
1All data: Data sets including SSE content;
2No SSE data: Data sets without inclusion of SSE content. These results strongly suggest that ARBC performs competitively with conventional classification approaches for this task, and consequently the use of ARBC does not involve an important loss of information derived from ARM. The performance of ARBC using only unique rules clearly decreased for all classification methods evaluated, although maintaining an acceptable accuracy of near or over 90% in most of the cases. This suggest that unique rules can be influential in classifying most of the PPI types considered in our study and that overlapping rules are important to improve the accuracy of the classification task. It is important to emphasize that the aim of our research is focused on the advantage of interpretability of the discovered rules rather than the optimization of the classification task.
We further investigated the influence of SSE information on the classification of PPI types. We evaluated three different data sets without using the secondary structure elements of proteins, including ARBC_WO_SSE, CWAR_WO_SSE and UR_WO_SSE. Each of the two rule profiles in this case contains a total of only 135 association rules and 43 unique rules. Results for these evaluations are also highlighted in Table 4. It was found that in all cases the performance of diverse classifiers tended to decrease when SSE data was omitted, although only a slightly reduction is observed in most of the classifiers evaluated. Interestingly the major decrement in performance was observed when employing UR_WO_SSE, with accuracies of less than 90% for all classifiers including SVM. These results strongly suggest that SSE content in interaction sites could have an important role in the discrimination of different PPI types for both approaches including ARBC and CWAR.
This implies that the average confidences of the rule sets that include this SSE content information may be higher than those without it. There were 14.01% (22 out of 157) such rules that included SSE content information and their average confidence was 0.533 (Table 5). When we considered the top 31 rules that are covered by 20% of all the rules, their confidence was 0.642. Among them, 42% (13 out of 31) contained SSE information with an average confidence of 0.661. The SSE content rules were enriched among those rules exhibiting higher confidences. The same trend was also seen with unique rules: while the average confidence of 58 unique rules was 0.536, that of the 16 unique SSE rules was 0.622. Here we infer that SSE content in interaction sites is a significant feature that permits reliable classification of the interaction types.
Table 5  Analysis of SSE content rules over different subsets
Subset  #. of rules  Fraction(%)  c o n f 1 a  #. of SSE rules  c o n f 2 b
SSE  c  22  14.01%  0.533  -  -
TOPK  d  31  19.75%  0.642  13  0.661
Unique  e  58  36.94%  0.536  16  0.622
acon f1: Average confidence of a rule subset;
bcon f2: Average confidence of SSE content rules in a rule subset;
cSSE: Association rules encoding SSE content;
dTOPK: Top K rules covering top 20% in confidence;
eUnique: Unique rules.