Synthetic sequence data
While of interest for comparison with previous studies, this set of species is not representative of the problem of incorporating phylogeny into scanning methods. Furthermore, evaluation of scanning algorithms using real sequence data is difficult, because of the presence of transcription factor binding sites that are likely real, but unreported. That is, because they have not yet been experimentally verified, some predicted sites reported as false positives may, in fact, be true positives. Thus, we generated synthetic data in which we controlled the binding site content. Specifically, as a typical example, we generated four sets of sequence data modeled on the phylogenetic relationship of fourteen prokaryotic species: seven Enterobacteriales (E. coli, S. typhi, Klebsiella pneumoniae, Salmonella bongori, Citrobacter rodentium, Shigella flexneri, & Proteus mirabilis), four Vibrionales (Vibrio cholerae, Vibrio parahaemolyticus, Vibrio vulnificus, & Vibrio fischeri), and three Pasteurellales (Haemophilus influenzae, Haemophilus somnus, & Haemophilus ducreyi).
The first synthetic data set consists of 140,000 simulated intergenic regions representing the orthologous promoter regions of 10,000 genes from the fourteen species, where each sequence is of length 500 bp, with two planted Crp sites, generated from the Crp motif model (Figure 1A). The second data set is the same but with "1/2-strength Crp" sites, where the average number of bits of information across the positions of a Crp motif is cut in half. The third data set contains "1/3-strength Crp" sites. The fourth data set is a negative control and contains no planted transcription factor binding sites. See the Methods and Figure 1 for more information.
Figure 1  Crp Binding Site Motif and Generation of Weaker Versions. The logo in panel A indicates the Crp motif used to scan for Crp binding sites. It is also used to generate a pair of full-strength Crp sites in the synthetic sequence data. The binding site equilibria were calculated from sequence data aligned by the Gibbs Recursive Sampler [49], and were plotted using publicly available software [27]. The logo in panel B indicates the motif used to generate 1/2-strength Crp sites. It was generated by raising each probability of a nucleotide to its 0.637th power, with subsequent scaling so that the probabilities of the four nucleotides for any motif column sum to 1.0. The exponent was chosen so that the average information content (i.e., "bits") would be half that value for the full-strength sites. The logo in panel C is the 1/3-strength Crp motif, generated with an exponent of 0.507 so that average information content would be one-third of the full-strength value. With each simulated gene, the sequences were generated respecting the phylogenetic tree shown in Figure 2, using the nucleotide evolution model of Halpern & Bruno (1998) [28] for transcription factor binding sites and the model of Kimura (1980) [29] (with a transition to transversion ratio of 3.0) for background positions, and without the introduction of sequence gaps. The phylogenetic tree was generated from aligned (using MUSCLE [30]) 16S rRNA gene data via PHYLIP [31] and tree branch lengths were scaled up by a factor of 13.5 so that the tree would represent evolution at neutral sequence positions rather than at the somewhat conserved 16S rRNA gene sequence positions. Although the factor of 13.5 reflects our previous experience (unpublished), it is not rigorously chosen; for this and other reasons, although this tree is realistic, it should not be considered definitive.
Figure 2  Phylogenetic Tree of Fourteen Prokaryotes. This tree of fourteen prokaryotes specifies the phylogenetic relationship of the species in our simulated sequence data. The tree is realistic, but approximate. The branch lengths represent the number of substitutions (including subsequent substitutions at a given sequence position) expected for each 10,000 nucleotides not subject to selection pressures. Based upon the distances in the phylogenetic tree we partitioned the fourteen species into four clades, the Vibrionales clade, the Pasteurellales clade, P. mirabilis (by itself), and the remaining Enterobacteriales (henceforth, the Enterobacteriales clade). To evaluate the trade-off between sensitivity and specificity, we ran PhyloScan using the full-strength Crp motif; we scanned the full-strength-Crp-sites sequence data (positive data) and the no-sites sequence data (negative data). Likewise, we ran PhyloScan using the 1/2-strength Crp motif, scanning the 1/2-strength sequence data (positive data) and the no-sites sequence data (negative data); we also ran PhyloScan using the 1/3-strength Crp motif, scanning the 1/3-strength sequence data (positive data) and the no-sites sequence data (negative data).
Additionally, we ran PhyloScan with some of its features disabled. In three pairs of runs, one for each motif strength, as above, we ran PhyloScan on the four clades of sequence data, but by disabling its Neuwald-Green calculation (see Methods) we did not permit PhyloScan to statistically incorporate any sites other than the best found binding site in each intergenic region. In another three pairs of runs we ran PhyloScan, permitting it to consider multiple sites within an intergenic region, but by disabling its Bailey-Gribskov calculation (see Methods) PhyloScan could not consider more than one clade, and we gave it only the sequence data from the Enterobacteriales clade. Finally, we ran MONKEY (which incorporates neither the Neuwald-Green nor the Bailey-Gribskov calculation) on the Enterobacteriales clade sequence data, in a final three pairs of runs.
Each of these twelve pairs of runs – four algorithms times three motif strengths – produced p-values for each of 10,000 synthetic orthologous intergenic regions with sites and for each of 10,000 synthetic orthologous intergenic regions without sites. When any of the algorithms is used, it is desirable to set a p-value cutoff so that, in the positive data, the number of intergenic regions that have values below this cutoff is large and, in the negative data, the number of the intergenic regions that have values below the cutoff is small. Because the relative importances of the former (sensitivity) and the latter (type I error) depend upon the particular experiment and the parameters of that experiment, it is common to plot a Receiver Operating Characteristic (ROC) curve of sensitivity vs. type I error, to show what is achievable from differing cutoff levels.
Figure 3 shows the ROC curves for nine of the twelve cases; for our synthetic sequence data, the disabling of the Neuwald-Green calculation had negligible effect, and these three ROC curves are omitted. In all cases the disabling of both the Neuwald-Green and Bailey-Gribskov calculations significantly affected performance. (See Figure 3 and its legend for more information.)
Figure 3  ROC Curves for PhyloScan and MONKEY. Shown are Receiver Operating Characteristic (ROC) curves for algorithms applied to intergenic regions containing a pair of full-strength Crp sites, a pair of 1/2-strength sites, and a pair of 1/3-strength sites. The simulated sequence data is for fourteen prokaryotic species organized into four clades; the orthologous intergenic sequences are 500 bp and are multiply-aligned within each clade but not between clades. ROC curves are shown for fully enabled PhyloScan and MONKEY. Additionally, ROC curves for PhyloScan applied to only the Enterobacteriales clade are shown. The ROC curves for PhyloScan with its multiple-clades capability enabled but its multiple-sites capability disabled are not shown because they are nearly indistinguishable from the fully enabled PhyloScan. A comparison of the "PhyloScan (1 clade)" curves to the "MONKEY (1 clade)" curves shows that there is value in combining evidence from multiple sites within an intergenic region using the Neuwald-Green calculation. A comparison of the "PhyloScan (4 clades)" curves to the "PhyloScan (1 clade)" curves indicates that there is additional value in considering data from multiple clades. For instance, if p-value cutoffs are chosen so that type I error is 0.1% (i.e., the specificity is 99.9%) then PhyloScan correctly classifies 99.85% of the full-strength-Crp intergenic regions, 72.68% of the 1/2-strength regions, and 32.64% of the 1/3-strength regions. The corresponding numbers for "PhyloScan (1 clade)" are 96.98%, 33.01%, and 10.11%. The corresponding numbers for MONKEY are 79.02%, 21.66%, and 6.33%. It is possible that sensitivities for the four-clades curves would have been even stronger if we had not prohibited the non-Enterobacteriales clades from rescuing intergenic regions in the Enterobacteriales clade that had failed to pass our 0.05 p-value cutoff.