1 Are the objective functions informative? The first step to design a new algorithm for the motif discovery problem is to choose a proper objective function. This is critical because the objective function implements the designer's understanding of the protein-DNA interaction model. Searching for candidates that optimize the objective function is a major step to pull out the candidate binding sites from the background sequences. An ideal objective function should be able to assign the optimal score to the true motif binding sites and nowhere else. Although there are numerous tools available, surprisingly the types of objective functions are not as many. Here we examined three popular objective functions. Theoretically, for each objective function we would test whether the score of the planted binding sites is superior to the scores of all other sets of words in the background sequences which are false positive predictions. This, of course, is impractical. In practice, we chose one tool that applies this objective function and compared the tool's prediction, which unfortunately is often a false positive, with the planted motif. If the planted motif has a better score, then the gap between the two scores shows the least extent to which the tool misses the global optimum of the objective function. On the other hand, if the prediction scores higher, it would suggest that the objective function is not accurate enough to model the true binding sites. Log likelihood ratio This ratio and its associated forms are used by most alignment-driven algorithms to assess the significance of motif candidates. When the candidates are of different lengths, the p-value of the ratio is used. A method to compute the p-value is described in [3]. The log likelihood ratio of the predicted motif m is where X is the set of sequences in the dataset, Pr(X|φ, Z) is the likelihood of the sequences X given the motif model φ and its binding sites Z, and Pr(X|p0) gives the likelihood of the sequences assuming the background model p0. MEME [4] carries out an EM-based procedure to search for a model that maximizes the likelihood ratio. The local optimum can sometimes be avoided by rerunning the program with different initializations. Figure 1 depicts, for each dataset from [1], the scores (the p-values of the log likelihood ratio in the negative logarithm scale) of MEME's predictions and the planted binding sites. For most datasets, the predictions of MEME have higher scores than the planted motifs. We conclude that even an algorithm guaranteeing the global optimal solution for the log likelihood ratio function will miss the true binding sites in these datasets, because this objective function does not accurately capture the nature of the binding sites. Figure 1 Objective function: p-Value of log likelihood ratio in negative logarithm scale. The figure exhibits the comparison of the p-value of the log likelihood ratio between the planted motifs ("TFBS" in the legend) and that of MEME's predictions for selected datasets from [1]: we use only "Generic" and "Markov" among the three types of datasets (see [1]), because in "Real" type datasets the predictions are possibly genuine binding sites of some unannotated transcription factor other than the ones planted. The datasets are sorted in ascending order of TFBS scores for clarity. For each dataset, there are two scores: the score of TFBS and the score of MEME's prediction. Points on the x-axis correspond to the datasets for which MEME didn't make any prediction. Now, consider one dataset in detail. The dataset is an example for which the planted motif has a higher log likelihood ratio score than MEME's prediction, yet we argue that log likelihood ratio still doesn't work well as an objective function in this case. In a way, the motif-searching problem is a classification problem: all the words of a certain length appearing in the sequences should be partitioned into two classes: the binding sites, and all the others. Training the optimal classifier equates to searching for the optimal candidate motif model. When the log likelihood ratio is applied as the objective function, the ultimate classifier would be a threshold of the log likelihood ratio score so that all the binding sites are above the threshold, and all the others are below it. A classifier corresponding to a good prediction can achieve a decent balance between the false positives and false negatives of the classification. Vice versa, if no threshold is satisfactory enough to classify the words, no good prediction can be found under this motif model. To test the classifiability of this dataset, we calculated the log likelihood ratio scores of all the words in it, including the true binding sites, and tried out various threshold values to classify the words. Among those having scores above the threshold, the numbers of words are counted which belong to binding sites and which belong to the background sequences. Figure 2 indicates that no matter what threshold we choose to identify the binding sites of the motif, we won't be able to find a value to achieve an acceptable balance between the sensitivity and the specificity of the classification. For example, to correctly classify all 11 true binding sites, the threshold must be chosen so low that 130 false positives are classified as binding sites of the motif. Figure 2 Classifiability using log likelihood ratios as thresholds. Each bar stands for a value of the cut-off threshold to distinguish the binding sites of the motif from background. The pair of numbers on the top of each bar indicate the number of false positives(FP) and the number of false negatives(FN) resulting from the classification. It is therefore fair to say that log likelihood ratio alone will not be able to separate the true motif from the background noise. We will return to it later. Z-score The Z-score measures the significance of predictions based on their over-representation. YMF [5] searches a restricted space defined by a consensus motif model and finds the candidates with the top Z-scores. The form of the Z-score is as follows: where obs(m) is the actual number of occurrences of the motif m, E(m) is the expected number of its occurrences in the background model, and σ(m) is the standard deviation. Consensus-based algorithms such as YMF are sometimes criticized for not being able to incorporate the true binding sites into the motif model. To focus on the objective function and spare the limitation induced by the consensus motif model, we fantasize a motif model for each dataset that comprises the planted binding sites completely and exclusively. We calculate the Z-scores of the predictions and the planted motifs for selected datasets, as shown in Figure 3. Note that the competition is actually not fair: with an expanded motif search space, the new optimum should be at least as high as the current prediction. Nevertheless, we consider the Z-score of the prediction as a touchstone: any score lower than it will not be competitive in the new search space. From Figure 3, we see that is exactly what happens in nearly all of the tested datasets. Note the similarity to results as shown in Figure 1 in the sense of our test: statistical over-representation as measured by Z-score does not necessarily mean binding preference either. Figure 3 Objective function: Z-score. The figure shows the comparison between the Z-scores of the planted motifs (TFBS in the legend) and the predictions of YMF for some datasets. For the sake of comparison simplicity, we only used datasets ("Generic" and "Markov" types only, for the same reason as in Figure 1) when predictions of YMF and the planted motifs have the same length. Sequence specificity Another type of objective function emphasizes the likelihood that most, if not all, sequences are potentially bound by the transcription factor. That means a prediction having multiple binding sites in one sequence and none in the others is much less significant than a prediction having a balanced number of binding sites in each sequence. This idea is designed into ANN-Spec [6] and Weeder [7]. The objective function, named sequence specificity, is defined in [7] as follows. where Ei(m|p0) is the expected number of motif m's occurrences in sequence i assuming the background model p0, and L is the total number of sequences in the dataset. We calculated the scores of the predictions of Weeder and ANN-Spec and the planted motifs. The planted motif has a higher score than the predictions of the tools for most datasets, as illustrated in Figure 4. The obvious gap between the scores of planted binding sites and the predictions reflects a lack of optimum of the search strategies adopted by these tools. Recall that ANN-Spec is a generalized version of SEM (Stochastic EM), and Weeder uses a greedy and heuristic search method. Figure 4 Objective function: sequence specificity score. The figure shows the comparison between the sequence specificity scores of the planted motifs (named TFBS in the legend) and the predictions of Weeder and ANN-Spec. For the same reason as in Figure 1, only datasets of "Generic" and "Markov" types are tested. The x-axis tells the indices of the datasets. The datasets are sorted in ascending order of TFBS scores for clarity. For each dataset, there are three scores: the score of TFBS motif, Weeder's prediction, and ANN-Spec's prediction, colored in red, blue and green respectively. Points on the x-axis corresponds to the datasets for which the tool didn't make any prediction. Comparing Figure 4 with the other objective functions (Figure 1, 3), this result shows certain promise that using the sequence specificity score may often lead to the true binding sites. From objective function point of view solely, sequence specificity seems to have the edge for our datasets. An assumption of this objective function is that most sequences in the datasets should have binding sites of the motif. Although our data shows that tools such as Weeder and ANN-Spec are not too sensitive to the slight departure from this assumption, we have not tested them on datasets with more deviation. The Z-score function is based on the statistical over-representation solely without any reference to biological theories. The log likelihood ratio relies on high-quality non-gapped alignments, but it's not clear that non-gapped alignments are powerful enough to model the true binding sites. No objective function meets our standard that all planted motifs should have scores at least as high as those of the predictions. We need to understand better the conservation information hidden among those binding sites.