Motif finding in genomic data We compare the performance of our method to three others, MEME [18], Gibbs Motif Sampler [39], and Projection [22]. We choose these particular methods as they are widely-used, readily accessibly for download via the internet, and can handle the lengths of motifs (11–48 bps) in our dataset. While a recent performance evaluation of motif finders [1] indicates that combinatorial methods such as Weeder [24] have somewhat better performance than other methods, most enumerative combinatorial methods, Weeder included, are not able to handle the lengths of the motifs in our bacterial dataset. Two of the methods we compare against, MEME and Gibbs Motif Sampler, are stochastic-search based algorithms. We run them requiring one motif instance per sequence and limiting the search to the primary sequence strand only, while leaving other parameters at their defaults. Gibbs Motif Sampler is run with 100 random restarts to allow for sufficient sampling of the search space, and MEME is allowed to execute its own algorithm to search the dataset for good starting points for EM. Note that Gibbs Motif Sampler failed to execute on three largest datasets, crp, ihf and purr when run on our local linux machines; these datasets were submitted through the web server. Projection method has a combinatorial component, combining the idea of locality-sensitive hashing with postprocessing by MEME, and unlike most enumerative combinatorial methods, Projection is not limited by motif length. Since many of the motifs in our dataset are not well conserved, we set the d parameter, assessing the average number of mismatches per motif instance, to it maximum suggested value of 0.25 of the motif length. Length parameters of the known motif are used for each dataset. We detail the performance of our algorithm on the full set of binding sites for 36 E. coli transcription factors in Table 2. Considering a motif to have been correctly identified if at least half of its sites were found with at least 25% overlap with the known site (as in [1], and essentially corresponding to datasets with sSn of at least 0.5), our method accurately discovers motifs for 22 transcription factors. Setting an e-value threshold at 1.0 (a lower threshold causes other methods to identify too few datasets as having significant motifs), we find statistically significant motifs for 33 datasets. Of the three transcription factor datasets with no significant solutions, one, hns, is a short motif, and the other two, oxyr and ihf, are very poorly conserved motifs with low information content (IC) [44], such that the average per-column IC for the known oxyr and ihf motifs are 0.89 and 0.36, respectively, whereas these values are 0.95 and 0.39 for their discovered motifs. In general, as compared to the motifs corresponding to the actual known transcription factor binding sites, the motifs found by our method exhibit equal or higher IC, measuring motif conservation in isolation from background sequence, as well as higher relative entropy, measuring the difference with the background distribution (Table 2). MEME reports motifs for all 36 transcription factor datasets, with e-values less than 1.0 for 20 of them. Gibbs Motif Sampler discovers motifs for 34 of the transcription factors, with 15 of them considered significant via their positive logMAP scores (no motifs are found for araC and flhCD). Projection reports motifs with no significance assessment for all 36 transcription factor datasets. The LP/DEE approach described in this paper has the best overall performance. Taking significance assessment into account, and considering all datasets with no significant motifs to have zero sSn and nPC values, our method produces 0.554 averaged sSn and 0.411 nPC values. Indeed, only two datasets, oxyr and ihf, have motifs that are deemed insignificant using our scheme yet have non-zero overlap with the actual motifs. Performance statistics for MEME and Gibbs Motif Sampler are considerably lower with the averaged sSn of 0.338 and nPC of 0.257 for Gibbs, and corresponding sSn and nPC values of 0.382 and 0.285 for MEME. Since Projection does not report significance values, we also note averages of raw coefficients for overlap with the known motifs while ignoring significance assessments. Our method still outperforms the others, though not as significantly, with sSn and nPC values of 0.565 and 0.414 for LP/DEE; 0.550 and 0.402 for Gibbs; 0.501 and 0.358 for MEME; 0.560 and 0.407 for Projection. We also show sSn and nPC values while ignoring significance for each of the three other methods compared to LP/DEE in Figure 1, only displaying transcription factor datasets for which a difference in performance is observed. Each bar in the chart measures the difference in sSn (Figures 1(a)–1(c)) or nPC (Figures 1(d)–1(f)) between our method and one of the other methods. Using both the sSn and the nPC statistics, LP/DEE performs better than any of the three other approaches in identifying known binding sites. For example, for LP/DEE versus MEME, very large differences are observed for three transcription factors, with our method identifying narL, glpR, and ntrC motifs almost completely, and MEME entirely misidentifying them. Moreover, the LP/DEE method exhibits better performance on more transcription factor datasets than the other methods. For example, considering nPC, LP/DEE performs better than MEME on eleven datasets, and worse than it on six datasets (Figure 1(e)). Differences in performance with Gibbs Motif Sampler and Projection are smaller; for instance, the LP/DEE method exhibits better performance than Projection using the sSn statistic on six datasets versus worse than it on two datasets (Figure 1(c)). We note that if significance assessments are included and motifs with e-value greater than 1.0 are discarded (see Additional Figures 1(a) and 1(b)), then LP/DEE has better nPC than MEME on 16 datasets, and worse nPC on three datasets, suggesting that MEME's significance computation is unnecessarily conservative for our dataset; the same applies to Gibbs Motif Sampler as well. Figure 1 Performance comparison for the LP/DEE method with Gibbs Motif Sampler, MEME and Projection. Performance comparison of the LP/DEE method with Gibbs Motif Sampler, MEME, and Projection when identifying E. coli regulatory sites. Performance is given in terms of the site level sensitivity 1(a)-1(c) and nucleotide performance coefficient 1(d)-1(f). Significance assessment is disregarded. For every transcription factor dataset, the height of the bar indicates the difference in the metric, with bars above zero specifying better performance for LP/DEE and bars below zero otherwise. Plotted are only those datasets for which there is a difference in performance between the pair of methods being compared. Our approach finds provably optimal solutions for 32 of the 36 datasets. Notwithstanding, our method also exhibits excellent runtimes for most problems. Of the 36 transcription factors we considered, 25 were solved in seconds with the application of clique-bounds DEE, some using tighter bounds constrained by three-way alignments. Seven required the application of graph decomposition with tighter clique-bounds DEE, and took a few minutes to three hours to solve. For the remaining four datasets, we used more computationally intensive speculative pruning. Interestingly, we found highly significant solutions for three of these datasets, albeit without the guarantee of optimality, and no significant solution for one. For each of them we provide a bound on the significance value of a potential optimal solution according to the method detailed in the section above. The e-values of the obtained motif and the lower bound in parentheses are: crp, 3.08 × 10-9 (6.04 × 10-33); fis, 1.37 × 10-6 (2.29 × 10-7); soxS, 1.26 × 10-9 (5.25 × 10-14); and ihf, 2.26 × 10+8 (3.98 × 10-31). Finally, in the entire data collection, all but two of the problems resulted in integral solutions to their LPs. The remaining instances with fractional solutions were easily solved by the ILP solver. Simulated data We also evaluate the effectiveness of our scoring scheme in finding binding sites for five regulatory proteins when they are embedded in simulated data. Our goals are twofold. First, since our underlying scoring measure is based on counting matches between nucleotides, it is important to see how well it performs in compositionally biased backgrounds. In the E. coli dataset, even a simple scoring scheme that assigns a score of 1 to matches and 0 to mismatches performs competitively (data not shown). However, since other genomes can have considerably more biased nucleotide compositions, our scoring scheme rewards matches between more rare nucleotides, and we test here how it performs in different scenarios. Second, while it is essential to test the performance of motif finding algorithms on genomic data (as above), it is possible that there are other conserved motifs in the data, besides those with which we are evaluating performance, and these conserved motifs lead to lower nPC and sSn measurements. Simulated data is not expected to have other conserved motifs, and thus provides a cleaner, though perhaps optimistic, means for testing motif finding approaches. In our testing on simulated data, we use a selection of five transcription factor datasets with motifs of varying levels of conservation, as measured by their IC (Table 3). We generate background sequences with uniform nucleotide distributions, as well as those with increasingly biased probability distributions. A background sequence for a particular binding site is generated of length equal to that of its upstream region (up to 600 bps). In particular, for each position, a base is selected at random according to a probability distribution in which base G is chosen with some probability pr(G) and the other bases with probability (1 - pr(G))/3 each. Table 3 Scoring method evaluation in terms of performance coefficient in biased-composition simulated data. Performance of LP/DEE in biased-composition simulated data. The first column identifies the TF dataset. The second column measures the degree of conservation of the known motif, as measured by average per-column information content [44]. The rest of the columns list the nucleotide performance coefficient of the LP/DEE method with the probability of base G indicated in the column heading and the frequencies of all other bases split equally. TF IC Bias 0.25 0.5 0.75 0.9 araC 1.00041 0.8113 0.9592 0.9592 0.9592 cpxR 1.17034 1.0000 0.8261 0.9811 0.9811 dnaA 1.45351 1.0000 0.7647 0.7647 1.0000 galR 1.34756 0.8824 0.8824 1.0000 1.0000 narP 1.40273 1.0000 1.0000 1.0000 1.0000 Our nPC performance is summarized in Table 3 for various background distributions. We find motifs of very high nPC values in varying biased nucleotide composition, attesting to the fact that our scoring scheme is successfully able to correct for bias in sequence composition. Moreover, as expected, performance on simulated data is better than that for actual genomic sequence. In the narP dataset, for example, the motif is found perfectly in simulated data and not at all in real genomic data. Additionally, an alternate highly conserved site is found by all four methods in genomic data (Table 2), suggesting that while the narP site is well-conserved, the corresponding genomic sequences contain another shared motif of higher conservation.