Results

Investigation on the interaction data
We have investigated the collected data set in this work. We measured the similarity of the selected targets with Smith-Waterman score (Figure 3A), and found that the similarities of vast majority of targets are low (<0.2), indicating that the homology of the selected targets in the data set is weak. X-ray and other biology studies suggest that a number of proteins contain more than one ligand-binding sites. For example, some enzymes possess two or more binding sites, one for substrate and another for activator/inhibitor. Therefore, we constructed a sites-ligand interaction network using a bipartite graph to check the degree distributions of both binding sites and ligands (Figure 3B and 3C). From Figure 3B we can see that each of the most binding sites bind with only one ligand, which is consistent with the fact that the binding of target and ligand is specific. Figure 3C shows that more than 95% ligands interact with only one site. In all, we can infer that the targets in the data set are low in homology, the connections of site-ligand bipartite graph are sparse and the average degree of binding sites is larger than that of ligands.
Figure 3  Investigation of the data set. A) The distribution of target sequence similarities B) The degree distributions of targets C) The degree distribution of the ligands

Comparison results
Since the original representative methods were implemented with different data sets, it is unfair to directly compare them with our method. Therefore, we implemented the algorithms on our data set, and evaluated the performances of all methods with multiple criteria, such as accuracy (ACC, the percentage of correct predictions), precision (the percentage of true positive instances in all predicted positive predictions), recall (the percentage of predicted true positive predictions in all true positive instances) and area under receiver operating characteristic curve (AUC, comprehensive evaluation of classifier performance, between 0.5 to 1, the larger the better). The result is shown in Table 2.
Table 2  Comparison result of the prediction performances
ACC  Precision  Recall  AUC
FIM  0.835  0.848  0.821  0.916
CS-PD  0.565  0.552  0.562  0.799
BLM-NII  0.727  0.712  0.812  0.858
RF  0.743  0.756  0.719  0.851 Table 2 shows that the ACC and AUC scores of CS-PD are 56.5% and 79.9% respectively, which means the correct prediction rate is only slightly higher than random guess (the expect correct rate of random guess is 50%) and the comprehensive performance is not good. We guess that the poor performance of CS-PD is due to lacking of powerful classifier and it only serves as a feature extraction approach. BLM-NII preforms good in our data set, but not as well as in its origin data set (Yamanishi's "Gold Standard"). The AUC score of BLM-NII is 85.8% in our data set, while it is more than 98% in all four categories (enzyme, ion channel, GPCR, nuclear receptor) in its origin data set. The difference of data set could be the main cause of the AUC difference. It is a pity that not all the crystal structures of the targets in Yamanishi's data set are determined, and we could not perform our approach in the "Gold Standard". The ACC and AUC scores of RF are 0.743% and 0.851% respectively, which are similar with BLM-NII. The bagging ensemble procedure might promote the prediction ability of RF model. The ACC and AUC of FIM are 82.7% and 91.6% respectively, which is much higher than that of CS-PD, BLM-NII and RF. The ACC and AUC score is promoted more than 10% and 5% respectively, compared with state-of-the-art (BLM-NII). In short, the FIM have shown remarkable predictive ability and outperforms other three approaches in our data set.

The role of global information in the binding
Because the intensity of intermolecular interaction decreases rapidly with the increasing distance, we can infer that local information may dominate the binding procedure. However, the local binding sites are inevitably influenced by the other part of target. We adopt the target sequence, obtained from the KEGG database [24], similarity score as global information. The sequence similarities are measured in nor-malized Smith-Waterman scores [25]. Kglo(t,t′)=SWt,t′/SW(t,t),SW(t′,t′) where t and t' are protein sequences, SW (·, ·) is the original Smith-Waterman s-core, and Kglo is a target global similarity matrix (global information). Finally, the global and local information are integrated by kernel trick, as follows:
(11)  K  t a r   ( s  * 1    , s  * 2     )  = λ K  g l o   ( t  * 1    , t  * 2     )  + 1 - λ   K  l o c   ( s  * 1    , s  * 2     )
where s∗1 and s∗2 are binding sites, t∗1 and t∗2 are target sequences corresponding to s∗1 and s∗2, and λ is the ratio of global information. After the introduction of global information, the kernels are no longer linear. We attempted to estimate the role of global information in the binding procedure by increasing the ratio λ. With the increase of global information (λ), the AUC score first increases, until λ = 0.3, then, the score reaches the maximum and further increasing of global information (λ) would result in the AUC score decreasing (Table 3). Although AUC score varies with λ, it only varies in a narrow range (from 0.916 to 0.922), which implies that the global information only has a limited influence on prediction accuracy.
Table 3  Local-global trade-off
λ  0.0  0.1  0.2  0.3  0.4  0.5
AUC  0.916  0.919  0.920  0.922  0.919  0.918 Another approach to analyze the importance of global information is to measure the difference of the target kernel matrix (including global and local information, λ = 0.3) and the local kernel matrix (Figure 4). The left, middle and right panels of Figure 4 are the global, global-local (λ = 0.3) and local kernel matrices respectively. Figure 4 shows that a large area of the global kernel matrix is blue, which means that most values in the global kernel matrix are small. Comparing to global kernel matrix, the values in the local kernel matrix are much larger and the local kernel matrix determine the global-local kernel matrix. That would be the reason why the weight of global information is as high as 30%, while AUC score varies less than 1% (Table 3).
Figure 4  Kernel matrix of global and local information. The left penal is the global kernel matrix; the middle penal is the local kernel matrix; the right penal is the global-local kernel matrix (λ = 0.3). The global-local kernel matrix and the local kernel matrix are similar, because the norm of global kernel matrix is small with regard to local kernel matrix. Based on the above facts, it is reasonable to infer that the fluctuation caused by global information is limited and local information dominates the binding predictive accuracy, which support our assumption that target-ligand binding is a local event.

Fragment interaction network analysis
In this section, we first give a brief overview of fragment interaction matrix. Then we investigate the underlying chemical mechanisms of fragments interactions.
An obvious feature of the fragment interaction matrix (Figure 5A) is that the values can be positive and negative, which means some fragment interactions are in favor of binding, and others not. Another obvious feature is that most of the values are close to zeros, which means the connections between site and ligand fragments are sparse. The sparse connection implies a site fragment only could recognize a small number of ligand fragments, which could reflect the specificity during the target-ligand binding procedure. Although there are 148653 (199 ∗ 747) elements in the matrix, only those whose value is larger than 0.1 are viewed as significant (the average standard error is 0.1). As a result, there are 9243 significant interactions in the network. During the significant interactions, the interaction values larger than 0.25 (top 20%) are regarded as import. Figure 5B shows the import fragment interactions.
Figure 5  Interaction network analysis. A) An overview of feature interaction network. The horizontal ordinate and longitudinal coordinates are ligand features and target features respectively. B) The import interaction network (a subnetwork of fragment interaction network). C) The top twenty interactions. The interactions can reflect the chemical interaction. According to the hypothesis, the feature interactions reflect the chemical interaction, as a result, it is necessary to investigate whether the feature interactions response the hypothesis. Since the number of interactions is large, we only analyze the top twenty interactions (Figure 5C), the others could be analyzed similarly. In Figure 5C, the first letter of site fragment is the center amino acid of the trimer cluster, and the letters in the parenthesis represent the subordinate amino acids. The smarts (a kind of molecular patterns) represent ligand fragments. The Figure 5C suggest that the feature interactions reflect the chemical interaction well, which in consistent with the hypothesis. For example, the major amino acid of site fragment 147 (TF147) is Aspartic (short for D), which could interact with ligand fragment 92 (LF92, containing keto group) through hydrogen bond, if the distance and orientation are appropriate. In some situations, the major amino acid of a target feature could not form significant interaction with ligand feature, but the subsidiary amino acid could. For example, the major amino acid of site fragment 57 (TF57) is isoleucine (short for I), which is a hydrophobic amino acid. Isoleucine could not interact with ligand fragment 44 (LF44), which contains amino group. However, the subsidiary amino acid of site fragment 57, such as threonine (short for T) and arginine (short for R) can form hydrogen bond with ligand fragment 44, if the distance and orientation are appropriate.