Background
Alteration of the frequency of transcription from DNA to messenger RNA is the primary means by which an organism controls gene expression. Transcription initiation is controlled primarily through the binding of transcription factors (proteins) to cognate sites on a chromosome (transcription factor binding sites). For a given transcription factor and an experimentally identified set of transcription factor binding sites, or a set of co-regulated promoters, computational methods can be applied to identify the DNA sequence pattern that is recognized by the transcription factor. Such a sequence pattern is commonly referred to as a motif, which is a conceptual extension of a single sequence, in which each position is characterized not by a single nucleotide, but rather by a column vector representing the probability with which each of the four nucleotides contributes to the pattern at that position.
The prediction of additional transcription factor binding sites by comparison of a motif to the promoter regions of an entire genome is a vexing problem, due to the large database size (approximately one half million intergenic base pairs for a typical prokaryote, and several hundred million base pairs for a mammal) and the relatively small width of a typical transcription factor binding site (6–30 bp). In such a large search space, chance alone results in the identification of many sites that match the motif. The problem is further compounded by variability among the transcription factor binding sites that are recognized by a transcription factor; such variability permits differences in the level of regulation, due to the altered intrinsic affinities for the transcription factor [1].
Programs that use a motif to search (i.e., scan) a sequence database for matches (i.e., predicted transcription factor binding sites) fall into two general categories. One approach is to employ a training set of transcription factor binding sites and a scoring scheme to evaluate predictions [2-8]. The scoring scheme is often based on information theory [9], and the training set is used to empirically determine a score threshold for reporting of the predicted transcription factor binding sites. The second method relies on a rigorous statistical analysis of the predictions, based upon modeled assumptions. Briefly, the statistical significance of a sequence match to a motif can be assessed through the determination of type I error (p-value): the probability of observing a match with a score as good or better in a randomly generated search space of identical size and nucleotide composition. The smaller the p-value, the less likely that the match is due to chance alone. Staden [10] presented an efficient method that exactly calculates this probability, and Neuwald et al. [11] described an implementation of this method.
When either of the two types of method is used to scan an entire genome, or the promoter regions of a genome, there is a difficult trade-off between sensitivity and specificity. If the threshold for a prediction (sites above a chosen information measure cutoff, or below a chosen p-value level) is chosen so as to reflect a reasonably low false positive rate (i.e., high specificity), it is frequently difficult to recover many of the known transcription factor binding sites that were used in the construction of the motif. Conversely, the choice of a threshold for prediction that finds many of the known transcription factor binding sites (i.e., high sensitivity) invariably leads to an overwhelming number of additional predicted sites, most of which are likely false positives. (Generally, we do not know where a transcription factor might bind in a way that does not affect transcription and thus, in this latter case, the functional interpretation of these "false positives" is somewhat subtle.)
The goal of the present study has been to increase the statistical power, when scanning a genome sequence database with a regulatory motif, by taking advantage of additional sequence data from related species and from multiple sites within an intergenic region. We have extended Staden's method [10] to allow scanning of orthologous sequence data that are either multiply aligned, unaligned, or a combination of aligned and unaligned. Our new algorithm, PhyloScan, an extension of Staden's method, statistically accounts for the phylogenetic dependence of the species contributing data to the alignment and calculates a p-value for the sequence match in the aligned data set. This approach is similar to the MONKEY method [12]; however, there are several key differences between the two.
MONKEY requires that all sequences be multiply aligned. However, this requirement is too restrictive for many transcription factors of interest that are conserved across a broad phylogenetic range. That is, there are many cases in which distantly related species contain orthologous transcription factors and binding sites, even though general sequence alignments are not feasible (e.g., between eubacteria and archaea [13-15]). Thus, we have developed a scanning approach that will find sites in mixed data that can include one or more clades of sequences (each of which can be aligned reliably) as well as sequences which cannot be aligned reliably to any other sequences.
Furthermore, regulatory modules often include multiple sites, none of which alone would be statistically significant in a genome-scale scan. Our procedure addresses this important case. In addition, our procedure permits use of a wide range of nucleotide substitution models, and it reports q-values [16], the fraction of intergenic regions of a given strength or better that are expected to be false, whereas MONKEY reports p-values, the fraction of false sites expected to show a given strength or better.