Discussion

Key features of PhyloScan
We are able to increase the flexibility and sensitivity of scanning, without increasing the false positive rate, by incorporating the following three key features into PhyloScan:
1. We allow a mixture of alignable and unalignable sequence data. Specifically, sequences that can be reliably multiply aligned should be grouped and aligned. These clades of multiply-aligned sequences, including each "degenerate clade" of one sequence that cannot be reliably aligned with any other sequence, are used by PhyloScan. A phylogenetic tree relating the sequences within a clade, a user-specified nucleotide substitution model, and an extension to Staden's precise p-value calculation that is phylogenetically aware are all employed by PhyloScan to increase the statistical power of Staden's original method. (See Methods.)
2. We combine evidence from multiple sites within an intergenic region to produce a better sensitivity than could be achieved by simply examining the strongest site within an intergenic region. Specifically, a group of weak sites, none of which is statistically significant in isolation, is detected by the fact that for some value i, the ith weakest of the sites is surprisingly strong given that it is the ith weakest. (See Methods.)
3. We report our findings in terms of q-values [16] instead of p-values. For each intergenic region we report the probability that a region of its significance or better will be a false prediction, instead of reporting the probability that a negative control will appear at this significance or better.

Applicability of PhyloScan
The test cases described here reflect our past and present research interests in proteobacterial gene regulation, while simultaneously emphasizing PhyloScan's ability to handle multiple weak binding sites as well as mixed aligned and unaligned sequence data. However, the features of our data set are not unique; there are many examples where multiple binding sites are common (e.g., flies [33] and humans [34]) or where transcription factors and their cognate binding sites are conserved across diverse species for which multiple sequence alignments are not feasible (e.g., between eubacteria and archaea [13-15]). PhyloScan will have clear advantages in such contexts. However, it is important to note that in situations where orthologous regions are usually alignable and for which the multiple-weak-sites scenario is unlikely, PhyloScan will not perform better than existing approaches such as MONKEY. In another direction, in cases where sequences cannot be aligned, PhyloScan will not perform better than existing approaches that handle "independent species."
Here we have demonstrated significant improvement of scan results through the use of sequences from evolutionary distant species that have orthologous transcription factors. This is not unexpected, given results of a more theoretical nature that quantify the extent of such improvement [35].

PhyloScan evaluates significance at the level of the intergenic region
A key focus of this work has been to combine evidence across transcription factor binding sites within an intergenic region and across orthologous regions in order to correctly identify intergenic regions that are likely to contain transcription factor binding sites, even when each of the identified transcription factor binding sites, considered in isolation, may not be sufficiently strong to be statistically significant. Accordingly, the individual sites included in our predictions are not necessarily statistically significant and individual site predictions may be false positives even within true-positive intergenic sequences.
For instance, in the collection of 10,000 synthetic data sets in which we planted two full-strength Crp transcription factor binding sites per intergenic region, we have 9,985 true positive intergenic regions at the 99.9% specificity level (see Figure 3). Of these true positives, in 6,287 of the E. coli intergenic regions two sites were predicted and the sites exactly coincided with the two planted sites. In 24 E. coli intergenic regions two sites were predicted and one of the two sites exactly coincided with a planted site. In 3,672 of these regions one site was predicted and it exactly coincided with one of the two planted sites, and in 2 of the E. coli intergenic regions, one site was predicted that did not exactly coincide with a planted site.

Key user-selectable parameters in PhyloScan

Focus on a target species or clade
In running PhyloScan, the user must specify two cutoff values, and can optionally specify additional parameters describing the expected multiplicity of binding sites upstream of a regulated gene. The first cutoff is a p-value cutoff, calculated on a per intergenic-sequence basis for the clade that includes the species of primary interest. We chose a default value of 0.05, so that weak intergenic regions in the target species' clade will not be considered, even when strong intergenic regions are located in orthologous regions in more-distantly related species. The choice of a larger value would reduce the focus on the target species, allowing strong sites in other species to rescue weak sites in the target species. The choice of a smaller value would increase the focus on the target species; the choice of a very small value would effectively cancel out the information available from the related species, since any intergenic region that looks extremely promising in the target species will almost surely continue to look promising when additional data are included.

Quality of reported sites
The second cutoff that our approach requires is the q-value cutoff that specifies which sites will be reported. We chose a default value of 0.001, meaning that according to our model, at most 0.1% of the intergenic sequences that we report as binding the transcription factor are chance false positives. While we have incorporated a fairly accurate phylogenetic model, we have not incorporated into this model such effects as the non-independence of the positions in a site (e.g., the effect of di- or tri-nucleotide energy terms, also known as stacking energies), nor effects from the cooperative binding of multiple transcription factors on the ability of a factor to bind to a DNA site. Because our model does not capture these and other features, the actual rate of false positives is likely to be higher than 0.1%.
On the other hand, in calculating the q-value, we have assumed that the vast majority of intergenic sequences in a genome will likely not contain a transcription factor binding site for the particular transcription factor under study, i.e., we are looking for rare events. Under this assumption, the proportion of all intergenic sequences that are truly null will approach 1.0 in Storey and Tibshirani's q-value calculation (the π^0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaBaaaleaacqaIWaamaeqaaaaa@2F9A@ term of [16]), and so does not appear in our q-value equation (see Methods). In a case where this assumption does not hold, the q-values provided by our approach will be overly conservative.
Note that the scan technology, first described by Staden [10] and employed here, is a frequentist hypothesis testing approach. A Bayesian approach presents an alternative through the use of Bayesian posterior probabilities for each site. Such an approach would require the specification of a model from which alternative sequences are drawn as well as null sequences. When a large number of observations are available the approach of Efron et al. [36] provides a compromise that yields local false discovery rates through the use of empirical Bayesian methods.

The number of sites per intergenic region
The number of potential sites to consider in each intergenic region, and their respective weights, are additional parameters that can be set by the user to best capture the underlying biology in the system under study. Generally speaking, for i ≥ 1, the algorithm detects that an intergenic region with sites is significant when its ith best site is surprisingly strong given its rank as the ith best site. The weight wi should be chosen in proportion to the number of such intergenic regions that are expected to have i as the first/lowest rank that appears strong by this test. We have set the default to have weights (w1, w2) = (0.9, 0.1) under the assumption that approximately 90% of intergenic regions with sites will have a strong site; among the remaining intergenic regions with sites, nearly all will have a site that is surprisingly strong given its rank as second strongest. (See the Methods.)

Divergently transcribed genes
The presence of divergently transcribed genes, that is, the circumstance in which an intergenic region is upstream of, and contains the promoters for, both of a given pair of neighboring genes, is quite common in prokaryotes, and also occurs in eukaryotes, albeit much less frequently. Divergently transcribed genes occur frequently in the E. coli genome (644 pairs of divergently transcribed genes), and their presence has raised the question of which orthologous data should be used when we combine p-values. In the present implementation of PhyloScan, the choice was made randomly. Thus, in such cases, we were as likely to make a "correct" choice as to make an "incorrect" choice, if only one of the E. coli genes flanking an intergenic region containing candidate transcription factor binding sites is regulated by the transcription factor of interest. However, in cases where gene synteny is conserved across several species, this choice becomes irrelevant. That is, when synteny is conserved, the same intergenic regions from each species will be examined regardless of the gene chosen; inspection of the output and, ultimately, experimental validation become necessary in order to evaluate whether a predicted site is associated with the chosen gene, with the divergently transcribed gene, or with both. Implementation of a systematic or informed choice in these situations will be a topic for the future development of PhyloScan.