3 Can other information help? The result of the multiple regression suggests a type of information that may help capture the hidden information in the motif's binding sites: the conservation of the binding sites' positions in the promoter sequences. It has been discussed in previous work (see [12]), but never integrated into the objective functions by the commonly used tools. As discussed above, log likelihood ratio alone is unlikely to distinguish the true binding sites from the background noise. Figure 6(a) shows a different view of Figure 1. The (inaccurate) predictions from MEME serve not only as false positives versus the planted motifs, but also perhaps the hardest to separate from the true binding sites. A simple horizontal line classifier obviously can not separate the true binding sites from the predictions. In Figure 6(b), we introduce a second number in each dataset: we performed a Kolmogorov-Smirnov test on the positions of the binding sites, and calculate its p-value assuming a uniform distribution as the background model. Now on the 2D plane, the axes correspond to the motifs' conservation in both sequence and position. It's easy to see that even a straight line classifier y - ax - b = 0 will separate the two sets decently. Let Prllr be the y value, the negative log p-value of the log likelihood ratio, Prpos be the x value, the negative log p-value of Kolmogorov-Smirnov test as explained above. Most true binding sites will fit aPrpos- Prllr+ b >0, and most false predictions of MEME will fit aPrpos- Prllr+ b < 0. The straight line in Figure 6(b) has parameters a = 13.5, b = 21. Figure 6 Position conservation information helps classification. In both figures, the y-axis is the negative log p-value of the log likelihood ratio of a motif in a dataset. The x-axis in (a) is the dataset index, in (b) it is the p-value of the Kolmogorov-Smirnov test on positions of a motif's binding sites, assuming a uniform distribution. Each point represents one dataset. For the same reason as in Figure 1, no "Real" type datasets are included. In (b) the straight green line decently classifies the two sets of points. This interesting result suggests a new form of objective function aPrpos - Prllr against MEME's predictions for the value of a calculated from Figure 6(b). Figure 7 displays a very promising result, as for all but one of the datasets the planted motif has a higher score than MEME's prediction. Of course, this comparison is somewhat unfair to MEME, as it wasn't trying to optimize this function. But we can't help but ask this question: if we optimize this form of objective function, will we be able to improve on the predictive accuracy of MEME and other tools? The idea is very tempting, at least. Of course, the "new" pursued objective function may be some other function of these two types of conservation information, as it's not necessarily linear, or if it is linear, the coefficients a and b can vary from data set to data set. Figure 7 A new objective function using position information. The figure shows the same test as in Figure 1 on a new objective function. The x-axis tells the indices of the datasets, the y-axis the value of the objective function for the motif, either planted (red points) or predicted by MEME (blue points). Only datasets of "Generic" and "Markov" types are tested. For all but one of the datasets, the planted motif has a higher score than MEME's prediction.