Detecting bacterial regulatory elements We apply our method to identify the binding sites of 36 E.coli regulatory proteins. We construct our dataset from that of [6,28], as described in [32]. For each binding site, we locate it within the genome and extract up to 600 bp of DNA sequence upstream from the gene it regulates. We remove binding sites for sigma factors, binding sites for transcription factors with fewer than three known sites, and those that could not be unambiguously located in the genome. Motif length parameters are set as reported by [28], except for crp, where a length of 18 instead of 22 is used. Background nucleotide frequencies are computed using the upstream regions for each dataset individually. The final dataset consists of 36 transcription factors, each regulating between 3 and 33 genes, with binding site length ranging between 11 and 48 (see Table 2). Table 2 Listing of the transcription factors' datasets (columns 1, 2, and 3) and the results of motif finding by LP/DEE. TF is the transcription factor dataset. Seq is the number of input sequences. Len is the length of the motif searched for. The rest of the listed measures refer to the motifs discovered by the LP/DEE algorithm: IC is the average per-column information content [44]; RE is the average per-column relative entropy; E-value is the e-value, computed according to our statistical significance assessment; nPC is the nucleotide level performance coefficient; and sSn is the site level sensitivity. The four starred entries indicate potentially non-optimal solutions; entries marked with † indicated usage of the ILP solver. TF Seq Len IC RE E-value nPC sSn ada 3 31 1.3000 1.0846 9.16 × 10-1 0.1341 0.33 araC 4 48 1.1437 0.9940 1.15 × 10-3 0.3474 0.50 arcA 11 15 1.2505 1.1992 4.31 × 10-6 0.4224 0.73 argR 8 18 1.2990 1.2149 1.30 × 10-7 0.2857 0.50 cpxR 7 15 1.3290 1.2337 1.09 × 10-5 0.5556 0.71 crp*† 33 18 0.7196 0.7045 3.08 × 10-9 0.5570 0.76 cytR 5 18 1.2317 1.1069 2.48 × 10-1 0.0588 0.20 dnaA 6 15 1.4535 1.3300 6.12 × 10-6 1.0000 1.00 fadR 5 17 1.3466 1.2074 1.33 × 10-2 0.5455 0.80 fis* 8 35 0.8927 0.8376 1.37 × 10-6 0.1966 0.38 flhCD 3 31 1.3942 1.1656 4.79 × 10-3 0.0000 0.00 fnr 10 22 1.1025 1.0476 1.85 × 10-9 0.6176 0.80 fruR 10 16 1.2094 1.1491 5.52 × 10-8 0.8182 0.90 fur 7 18 1.3285 1.2332 1.28 × 10-8 0.4237 0.71 galR 7 16 1.5445 1.4347 1.52 × 10-16 0.5034 0.71 glpR 4 20 1.4227 1.2441 2.63 × 10-2 0.5534 0.75 hns 5 11 1.5175 1.3660 2.25 0.0000 0.00 ihf* 19 48 0.3932 0.3859 2.26 × 10+8 0.0381 0.16 lexA 17 20 1.1481 1.1192 1.01 × 10-40 0.7215 0.88 lrp 4 25 1.2879 1.1237 6.44 × 10-2 0.0989 0.25 malT 6 10 1.5071 1.3815 1.73 × 10-1 0.0000 0.00 metJ 5 16 1.6842 1.5195 3.37 × 10-12 0.6495 1.00 metR 6 15 1.3097 1.1970 6.57 × 10-2 0.0000 0.00 modE 3 24 1.5618 1.3145 3.95 × 10-4 1.0000 1.00 nagC 5 23 1.2795 1.1462 1.03 × 10-3 0.0360 0.20 narL 10 16 1.1391 1.0828 8.06 × 10-4 0.8182 0.90 narP 4 16 1.4534 1.2737 7.48 × 10-4 0.0000 0.00 ntrC 4 17 1.6621 1.4605 1.28 × 10-8 0.6386 1.00 ompR 4 20 1.3566 1.1860 4.27 × 10-6 0.0000 0.00 oxyR 4 39 1.0965 0.9521 2.64 0.0796 0.25 phoB 8 22 1.1567 1.0835 4.14 × 10-9 0.8051 1.00 purR 20 26 0.8305 0.8147 1.53 × 10-37 0.7247 0.95 soxS*† 11 35 0.7771 0.7453 1.26 × 10-9 0.0815 0.27 trpR 4 24 1.4069 1.2291 3.74 × 10-6 0.8462 1.00 tus 5 23 1.5839 1.4276 1.05 × 10-17 0.8400 1.00 tyrR 10 22 1.0693 1.0159 3.63 × 10-9 0.5017 0.70 We evaluate the overlap between motif predictions made by our approach and the known motifs using the nucleotide level performance coefficient (nPC) [1,17]. Let nTP, nFP, nTN, nFN refer to nucleotide level true positives, false positives, true negatives and false negatives respectively. For example, nTP is the number of nucleotides in common between the known and predicted motifs. The nPC is defined as nTP/(nTP + nFN + nFP); it is a stringent statistic, penalizing a method for both failing to identify any nucleotide belonging to the motif as well as falsely predicting any nucleotide not belonging to the motif. Though nPC takes both false positives and false negatives at the nucleotide level into account, we also find it useful to consider site level statistics. Following [1], we consider two sites to be overlapping if they overlap by at least one-quarter the length of the site. Defining site level statistics similarly to the nucleotide level statistics above (e.g., site level true positives, sTP, is the number of known sites overlapped by predicted sites), site level sensitivity sSn is sTP/(sTP + sFN).