Real sequence data
To evaluate the statistical power provided by different facets of the PhyloScan approach in real sequence data, we measured the increase in sensitivity originating from three sources: a reduction in database size, the use of aligned sequence data only, and the use of non-alignable ortholog data.
As a stripped-down baseline, we applied PhyloScan in a scan of the full E. coli sequence database, ignoring all other sequence data; this baseline is equivalent to the original Staden method, and thus has the same statistical power.
We compared the baseline to the results achievable from a reduced database. When orthologous sequences are aligned between closely related species, gaps may be introduced, and there are often portions of the sequence that do not align; thus, the overall feasible search space for transcription factor binding sites is reduced. A search of such a reduced database in and of itself will allow the detection of more statistically significant transcription factor binding sites than will a search of a full set of intergenic regions from a single species. Therefore, the scanning results from a database reduced in size, yet containing data from only one species, will provide a measure of the increase in sensitivity to the baseline scan that is due simply to a reduction in search space.
We compared the baseline and reduced-database results to those obtained by scanning a database of aligned E. coli-S. typhi sequences, in order to measure the increase in sensitivity provided by the use of this aligned sequence data.
To test these sources of statistical power, we generated databases of promoter-containing E. coli intergenic regions, aligned E. coli-S. typhi intergenic regions, and motif models based on known Crp and PurR sites (see Methods). Specifically, the three databases contained: (1) the set of all E. coli intergenic regions, (2) the E. coli sequences extracted from the alignments of E. coli-S. typhi orthologous intergenic regions, and (3) the E. coli-S. typhi aligned intergenic regions data. Relative to the original method of Staden, our results show large improvement in the number of predicted transcription factor binding sites due to the alignment of two somewhat closely related species (Table 1 and Figures 4 and 5). Specifically, with a q-value cutoff of 0.001 (see Methods) the scanning of the set of all E. coli intergenic sequences results in only one Crp-significant intergenic region (with two predicted Crp sites), and one PurR-significant intergenic region (with one PurR site). No improvement was obtained in the reduced database of E. coli intergenic sequences. However, when the set of E. coli-S. typhi aligned sequences was scanned, 10 Crp-significant intergenic regions (with 13 Crp sites total), and 12 PurR-significant intergenic regions (with 13 PurR sites total) were predicted.
Table 1  Summary of PhyloScan Predictions
C1  C2  C3  C4  C5  C6
E. coli Sequence Data  Fulla  Fulla  Red.b  Red.b  Red. & Alignedc  Red. & Alignedc
Indep. Species  No  Yes  No  Yes  No  Yes
Crp Knownd  1(2)  7(10)  1(2)  8(12)  4(6)  11(16)
Crp Noveld  0(0)  16(20)  0(0)  16(18)  6(7)  18(21)
PurR Knownd  1(1)  9(9)  1(1)  11(11)  9(9)  12(12)
PurR Noveld  0(0)  4(5)  0(0)  4(5)  3(4)  6(7)
This table shows the number of E. coli intergenic regions predicted by PhyloScan to contain Crp or PurR binding sites, with the total number of sites predicted within parentheses. Column C1 is for a scan of the full set of E. coli intergenic sequence data (excluding the S. typhi sequence data and the sequence data from the other, independent clades). Column C3 is for a scan of only that E. coli sequence that is alignable with S. typhi; the S. typhi sequence data continue to be excluded. Column C5 is for a scan of the aligned E. coli-S. typhi sequence data. Columns C2, C4, and C6, are like Columns C1, C3, and C5, respectively, but the sequence data from the independent clades are also incorporated. Observing the lack of improvement of Column C3 over Column C1 (or the meager improvement of C4 over C2), we conclude that there is minimal gain in sensitivity from considering only E. coli sequence that is alignable with S. typhi, when not actually using the aligned S. typhi sequence data. Observing the modest improvement of C5 over C3 (or C6 over C4), we conclude that incorporating the aligned S. typhi sequence gives a moderate gain in sensitivity. Observing the large improvement of C2 over C1 (or C4 over C3, or C6 over C5), we conclude that incorporating the data from species that are not alignable with E. coli gives a significant gain in sensitivity. Notes: aDatabase of 2379 intergenic sequences from E. coli [see Additional file 2]. bDatabase of E. coli sequences (reduced search space) extracted from the E. coli-S. typhi database (see Real Sequence Data in Results). cDatabase of E. coli-S. typhi aligned intergenic sequences (see Real Sequence Data in Results). dThe number of E. coli intergenic regions predicted by PhyloScan to contain Crp or PurR binding sites, where the total number of binding sites detected is in parentheses and those sites that correspond to known, experimentally verified transcription factor binding sites and those sites that are novel (not yet verified) are indicated.
Figure 4  Crp-Significant Intergenic Regions Found. When counting Crp-significant intergenic regions, comparison of the bars labeled "+" (with the unalignable sequences) relative to those labeled "-" (without the unalignable sequences) indicates that the largest gain in sensitivity comes from the use of unalignable, evolutionarily distant sequences. The left part of this figure shows the sensitivity for the scan of E. coli data only. The center part of this figure shows the sensitivity from the scan of only those E. coli sequence data that are alignable with S. typhi. The right part of this figure shows the sensitivity from the scan of E. coli-S. typhi aligned sequence data.
Figure 5  PurR-Significant Intergenic Regions Found. The results for PurR are similar to those for Crp. See the caption of Figure 4. Furthermore, in each of the tests described above (using the baseline, the reduced-database, or the aligned sequence data) we can incorporate non-alignable orthologous sequence data to measure the impact of these additional data on sensitivity. Thus, to determine the extent to which additional, more distantly related, species could provide evidence to support a particular candidate transcription factor binding site upstream of a particular gene in the target species, we used PhyloScan to scan the orthologous intergenic regions for that candidate gene from the additional species (clades), assuming phylogenetic independence between clades. The p-value representing the combined evidence supporting a transcription factor binding site prediction was then calculated using the method of Bailey and Gribskov [32], as described in the Methods.
To demonstrate this approach with the E. coli Crp and PurR examples, we employed orthologous data from the five additional gamma-proteobacterial species listed above. We used PhyloScan to identify potential Crp and PurR transcription factor binding sites in the E. coli-only and E. coli-S. typhi aligned data sets, using a Pintergenic ≤ 0.05 cutoff to select candidate intergenic regions for examination in the other five species. As summarized in Table 1, depicted in Figures 4 and 5, and described below, we observed a considerable increase in the number of predicted transcription factor binding sites at the q-value ≤ 0.001 level, when the evidence from the five additional gamma-proteobacterial species was included by combining p-values.
For example, PhyloScan identified a total of 10 Crp-significant intergenic regions in the E. coli-S. typhi aligned data, but after combination of the evidence from the remaining five species, a total of 29 Crp-significant intergenic regions were predicted, a near tripling. Compared to a simple search of the raw E. coli intergenic sequences (one Crp-significant intergenic region), this represents a tremendous increase in sensitivity. The results with the PurR model were also dramatic: the use of data from S. typhi, Y. pestis, H. influenzae, and V. cholerae provided a 50% increase in the number of PurR-significant intergenic regions (to 18 from 12), compared to the scanning of E. coli-S. typhi aligned intergenic sequences only. In the E. coli sequence alone there was only a single PurR-significant intergenic region. In the Supplementary Materials are tables listing the located sites for Crp [see Additional file 3] and PurR [see Additional file 4], as well as captions for these tables [see Additional file 1].
We also examined the best 20 reported intergenic regions for each of the six approaches shown in Table 1. We see several differences, not only in the reported q-values, but also in the order and appearance of predicted binding sites in intergenic regions; see the caption of Table 2 for more details.
Table 2  Top 20 Predictions by PhyloScan
C1  C2  C3  C4  C5  C6
E. coli Sequence  Fulla  Fulla  Reducedb  Reducedb  Reduced & Alignedc  Reduced & Alignedc
Indep. Species  No  Yes  No  Yes  No  Yes
Rank  Gene  log(q)  Gene  log(q)  Gene  log(q)  Gene  log(q)  Gene  log(q)  Gene  log(q)
1  yibI  -4.65  cdd  -9.28  mtlA  -5.14  mtlA  -9.76  mtlA  -7.66  mtlA  -12.15
2  yqcE  -2.86  glpT  -7.21  ygcW  -2.89  cdd  -9.60  yjcB  -4.55  glpA  -9.19
3  b1904  -2.61  mglB  -6.01  yjcB  -2.62  glpA  -8.31  gcd  -3.99  cdd  -9.16
4  fucA  -2.51  yibI  -5.26  yjiY  -2.60  mglB  -6.53  b2146  -3.97  mglB  -7.60
5  deaD  -2.51  yjiY  -4.57  b2146  -2.53  gapA  -5.21  fucA  -3.93  udp  -6.26
6  yjiY  -2.42  hemC  -4.38  fucA  -2.51  udp  -5.17  ygcW  -3.42  gapA  -6.02
7  cdd  -2.29  deaD  -4.35  deaD  -2.47  yjiY  -4.79  flhD  -3.03  yjcB  -5.09
8  yeaA  -2.22  ysgA  -4.33  cdd  -2.31  cyaA  -4.70  gapA  -3.03  cyaA  -5.04
9  yhcR  -2.06  yhcR  -3.99  gapA  -2.22  deaD  -4.37  ycdZ  -3.01  malE  -4.83
10  ycdZ  -1.96  yqcE  -3.56  qseA  -2.03  malE  -4.29  udp  -2.78  ycdZ  -4.69
11  b2736  -1.87  adhE  -3.47  ycdZ  -1.98  ygcW  -3.63  b2248  -2.76  adhE  -4.56
12  uxaC  -1.81  ycdZ  -3.45  mglB  -1.90  adhE  -3.58  glpA  -2.76  b2146  -4.53
13  ysgA  -1.77  yeaA  -3.44  udp  -1.86  ycdZ  -3.52  mglB  -2.73  fucA  -4.46
14  glpT  -1.75  mlc  -3.37  uxaC  -1.85  mlc  -3.48  qseA  -2.68  pckA  -4.09
15  mglB  -1.63  b1904  -3.31  glpA  -1.84  fucA  -3.32  pckA  -2.36  aer  -3.97
16  pckA  -1.39  fucA  -3.23  pckA  -1.45  yjcB  -3.32  adhE  -2.14  ygcW  -3.78
17  serA  -1.23  b2736  -3.18  malE  -1.36  pckA  -3.23  aer  -2.13  gcd  -3.67
18  aer  -1.23  pckA  -3.17  aer  -1.32  aer  -3.17  cdd  -2.10  deaD  -3.65
19  adhE  -1.22  aer  -3.08  serA  -1.32  qseA  -3.07  deaD  -2.04  serA  -3.62
20  mlc  -1.01  yjeG  -3.05  adhE  -1.28  uxaC  -3.07  uxaC  -2.02  mlc  -3.62
# Diffs from C6   10   11   3   3   4   0
Because it is sometimes instructive to examine a fixed number of top hits regardless of the reported q-values, in this table we compare the six approaches' best 20 intergenic regions for Crp. By comparing each column to Column C6, which is the best approach we employed, we see that the C1-C5 approaches give significantly different q-values for, and orderings of, the predicted regulated genes. As indicated in the bottom row, the C1-C5 approaches miss several of the top-20 genes reported in C6, replacing them with genes that did not make the C6 top-20 list. In particular, although it uses all of the sequence data except S. typhi, C2 is significantly different from C6. Furthermore, although C3 has few differences from C6 in the set of genes indicated, the q-values of C3 are considerably worse and the gene order is substantially rearranged. These data suggest that the ability to simultaneously handle both aligned and unaligned data is important in obtaining accurate predictions. Notes: abcSee the caption notes for Table 1. Also see the Table 1 caption for descriptions of Columns C1-C6. It is worth noting here that the non-alignable species were selected for combination of p-values based upon the presence or absence of the transcription factor under study. All gamma-proteobacteria used in this study encode orthologs to Crp; hence, data for all species were included when p-values were combined from scans with the Crp motif. In contrast, because S. oneidensis and P. aeruginosa do not encode PurR orthologs, these species were not considered when we scanned for PurR binding sites.