Simulated data We also evaluate the effectiveness of our scoring scheme in finding binding sites for five regulatory proteins when they are embedded in simulated data. Our goals are twofold. First, since our underlying scoring measure is based on counting matches between nucleotides, it is important to see how well it performs in compositionally biased backgrounds. In the E. coli dataset, even a simple scoring scheme that assigns a score of 1 to matches and 0 to mismatches performs competitively (data not shown). However, since other genomes can have considerably more biased nucleotide compositions, our scoring scheme rewards matches between more rare nucleotides, and we test here how it performs in different scenarios. Second, while it is essential to test the performance of motif finding algorithms on genomic data (as above), it is possible that there are other conserved motifs in the data, besides those with which we are evaluating performance, and these conserved motifs lead to lower nPC and sSn measurements. Simulated data is not expected to have other conserved motifs, and thus provides a cleaner, though perhaps optimistic, means for testing motif finding approaches. In our testing on simulated data, we use a selection of five transcription factor datasets with motifs of varying levels of conservation, as measured by their IC (Table 3). We generate background sequences with uniform nucleotide distributions, as well as those with increasingly biased probability distributions. A background sequence for a particular binding site is generated of length equal to that of its upstream region (up to 600 bps). In particular, for each position, a base is selected at random according to a probability distribution in which base G is chosen with some probability pr(G) and the other bases with probability (1 - pr(G))/3 each. Table 3 Scoring method evaluation in terms of performance coefficient in biased-composition simulated data. Performance of LP/DEE in biased-composition simulated data. The first column identifies the TF dataset. The second column measures the degree of conservation of the known motif, as measured by average per-column information content [44]. The rest of the columns list the nucleotide performance coefficient of the LP/DEE method with the probability of base G indicated in the column heading and the frequencies of all other bases split equally. TF IC Bias 0.25 0.5 0.75 0.9 araC 1.00041 0.8113 0.9592 0.9592 0.9592 cpxR 1.17034 1.0000 0.8261 0.9811 0.9811 dnaA 1.45351 1.0000 0.7647 0.7647 1.0000 galR 1.34756 0.8824 0.8824 1.0000 1.0000 narP 1.40273 1.0000 1.0000 1.0000 1.0000 Our nPC performance is summarized in Table 3 for various background distributions. We find motifs of very high nPC values in varying biased nucleotide composition, attesting to the fact that our scoring scheme is successfully able to correct for bias in sequence composition. Moreover, as expected, performance on simulated data is better than that for actual genomic sequence. In the narP dataset, for example, the motif is found perfectly in simulated data and not at all in real genomic data. Additionally, an alternate highly conserved site is found by all four methods in genomic data (Table 2), suggesting that while the narP site is well-conserved, the corresponding genomic sequences contain another shared motif of higher conservation.