PMC:1794230 / 7905-30448
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"17244358-10390542-1689308","span":{"begin":385,"end":387},"obj":"10390542"},{"id":"17244358-11812853-1689308","span":{"begin":385,"end":387},"obj":"11812853"},{"id":"17244358-11545272-1689308","span":{"begin":385,"end":387},"obj":"11545272"},{"id":"17244358-11115104-1689308","span":{"begin":385,"end":387},"obj":"11115104"},{"id":"17244358-11321589-1689308","span":{"begin":385,"end":387},"obj":"11321589"},{"id":"17244358-11750821-1689308","span":{"begin":385,"end":387},"obj":"11750821"},{"id":"17244358-11827949-1689308","span":{"begin":385,"end":387},"obj":"11827949"},{"id":"17244358-11282972-1689308","span":{"begin":385,"end":387},"obj":"11282972"},{"id":"17244358-11160901-1689308","span":{"begin":385,"end":387},"obj":"11160901"},{"id":"17244358-12368244-1689308","span":{"begin":385,"end":387},"obj":"12368244"},{"id":"17244358-12368244-1689309","span":{"begin":678,"end":680},"obj":"12368244"},{"id":"17244358-12368244-1689310","span":{"begin":1140,"end":1142},"obj":"12368244"},{"id":"17244358-9656490-1689311","span":{"begin":4054,"end":4056},"obj":"9656490"},{"id":"17244358-7463489-1689312","span":{"begin":4129,"end":4131},"obj":"7463489"},{"id":"17244358-15034147-1689313","span":{"begin":4319,"end":4321},"obj":"15034147"},{"id":"17244358-9672829-1689314","span":{"begin":16918,"end":16920},"obj":"9672829"}],"text":"Results\nWe evaluated PhyloScan on both real and synthetic data. For the real data, we chose the Escherichia coli Crp and PurR motifs, and we gathered genome sequence data for several gamma-proteobacteria. We and others have previously demonstrated that a comparative genomic approach is effective in the prediction of transcription factor binding sites within this phylogenetic group [17-26]. Among the species chosen for this study (E. coli, Salmonella enterica serovar Typhi (S. typhi), Yersinia pestis, Haemophilus influenzae, Vibrio cholerae, Shewanella oneidensis, and Pseudomonas aeruginosa), only E. coli and S. typhi exhibit sufficient homology in the promoter regions [26]. Thus, we aligned orthologous intergenic regions for these two species, and we combined the statistical evidence from the scanning of the aligned E. coli and S. typhi data with the statistical evidence from the scanning of unaligned orthologous intergenic regions from the remaining five, more distantly related, species. (Approaches in which the S. typhi sequence data is considered independent of the E. coli sequence data were considered in earlier work [26].)\n\nSynthetic sequence data\nWhile of interest for comparison with previous studies, this set of species is not representative of the problem of incorporating phylogeny into scanning methods. Furthermore, evaluation of scanning algorithms using real sequence data is difficult, because of the presence of transcription factor binding sites that are likely real, but unreported. That is, because they have not yet been experimentally verified, some predicted sites reported as false positives may, in fact, be true positives. Thus, we generated synthetic data in which we controlled the binding site content. Specifically, as a typical example, we generated four sets of sequence data modeled on the phylogenetic relationship of fourteen prokaryotic species: seven Enterobacteriales (E. coli, S. typhi, Klebsiella pneumoniae, Salmonella bongori, Citrobacter rodentium, Shigella flexneri, \u0026 Proteus mirabilis), four Vibrionales (Vibrio cholerae, Vibrio parahaemolyticus, Vibrio vulnificus, \u0026 Vibrio fischeri), and three Pasteurellales (Haemophilus influenzae, Haemophilus somnus, \u0026 Haemophilus ducreyi).\nThe first synthetic data set consists of 140,000 simulated intergenic regions representing the orthologous promoter regions of 10,000 genes from the fourteen species, where each sequence is of length 500 bp, with two planted Crp sites, generated from the Crp motif model (Figure 1A). The second data set is the same but with \"1/2-strength Crp\" sites, where the average number of bits of information across the positions of a Crp motif is cut in half. The third data set contains \"1/3-strength Crp\" sites. The fourth data set is a negative control and contains no planted transcription factor binding sites. See the Methods and Figure 1 for more information.\nFigure 1 Crp Binding Site Motif and Generation of Weaker Versions. The logo in panel A indicates the Crp motif used to scan for Crp binding sites. It is also used to generate a pair of full-strength Crp sites in the synthetic sequence data. The binding site equilibria were calculated from sequence data aligned by the Gibbs Recursive Sampler [49], and were plotted using publicly available software [27]. The logo in panel B indicates the motif used to generate 1/2-strength Crp sites. It was generated by raising each probability of a nucleotide to its 0.637th power, with subsequent scaling so that the probabilities of the four nucleotides for any motif column sum to 1.0. The exponent was chosen so that the average information content (i.e., \"bits\") would be half that value for the full-strength sites. The logo in panel C is the 1/3-strength Crp motif, generated with an exponent of 0.507 so that average information content would be one-third of the full-strength value. With each simulated gene, the sequences were generated respecting the phylogenetic tree shown in Figure 2, using the nucleotide evolution model of Halpern \u0026 Bruno (1998) [28] for transcription factor binding sites and the model of Kimura (1980) [29] (with a transition to transversion ratio of 3.0) for background positions, and without the introduction of sequence gaps. The phylogenetic tree was generated from aligned (using MUSCLE [30]) 16S rRNA gene data via PHYLIP [31] and tree branch lengths were scaled up by a factor of 13.5 so that the tree would represent evolution at neutral sequence positions rather than at the somewhat conserved 16S rRNA gene sequence positions. Although the factor of 13.5 reflects our previous experience (unpublished), it is not rigorously chosen; for this and other reasons, although this tree is realistic, it should not be considered definitive.\nFigure 2 Phylogenetic Tree of Fourteen Prokaryotes. This tree of fourteen prokaryotes specifies the phylogenetic relationship of the species in our simulated sequence data. The tree is realistic, but approximate. The branch lengths represent the number of substitutions (including subsequent substitutions at a given sequence position) expected for each 10,000 nucleotides not subject to selection pressures. Based upon the distances in the phylogenetic tree we partitioned the fourteen species into four clades, the Vibrionales clade, the Pasteurellales clade, P. mirabilis (by itself), and the remaining Enterobacteriales (henceforth, the Enterobacteriales clade). To evaluate the trade-off between sensitivity and specificity, we ran PhyloScan using the full-strength Crp motif; we scanned the full-strength-Crp-sites sequence data (positive data) and the no-sites sequence data (negative data). Likewise, we ran PhyloScan using the 1/2-strength Crp motif, scanning the 1/2-strength sequence data (positive data) and the no-sites sequence data (negative data); we also ran PhyloScan using the 1/3-strength Crp motif, scanning the 1/3-strength sequence data (positive data) and the no-sites sequence data (negative data).\nAdditionally, we ran PhyloScan with some of its features disabled. In three pairs of runs, one for each motif strength, as above, we ran PhyloScan on the four clades of sequence data, but by disabling its Neuwald-Green calculation (see Methods) we did not permit PhyloScan to statistically incorporate any sites other than the best found binding site in each intergenic region. In another three pairs of runs we ran PhyloScan, permitting it to consider multiple sites within an intergenic region, but by disabling its Bailey-Gribskov calculation (see Methods) PhyloScan could not consider more than one clade, and we gave it only the sequence data from the Enterobacteriales clade. Finally, we ran MONKEY (which incorporates neither the Neuwald-Green nor the Bailey-Gribskov calculation) on the Enterobacteriales clade sequence data, in a final three pairs of runs.\nEach of these twelve pairs of runs – four algorithms times three motif strengths – produced p-values for each of 10,000 synthetic orthologous intergenic regions with sites and for each of 10,000 synthetic orthologous intergenic regions without sites. When any of the algorithms is used, it is desirable to set a p-value cutoff so that, in the positive data, the number of intergenic regions that have values below this cutoff is large and, in the negative data, the number of the intergenic regions that have values below the cutoff is small. Because the relative importances of the former (sensitivity) and the latter (type I error) depend upon the particular experiment and the parameters of that experiment, it is common to plot a Receiver Operating Characteristic (ROC) curve of sensitivity vs. type I error, to show what is achievable from differing cutoff levels.\nFigure 3 shows the ROC curves for nine of the twelve cases; for our synthetic sequence data, the disabling of the Neuwald-Green calculation had negligible effect, and these three ROC curves are omitted. In all cases the disabling of both the Neuwald-Green and Bailey-Gribskov calculations significantly affected performance. (See Figure 3 and its legend for more information.)\nFigure 3 ROC Curves for PhyloScan and MONKEY. Shown are Receiver Operating Characteristic (ROC) curves for algorithms applied to intergenic regions containing a pair of full-strength Crp sites, a pair of 1/2-strength sites, and a pair of 1/3-strength sites. The simulated sequence data is for fourteen prokaryotic species organized into four clades; the orthologous intergenic sequences are 500 bp and are multiply-aligned within each clade but not between clades. ROC curves are shown for fully enabled PhyloScan and MONKEY. Additionally, ROC curves for PhyloScan applied to only the Enterobacteriales clade are shown. The ROC curves for PhyloScan with its multiple-clades capability enabled but its multiple-sites capability disabled are not shown because they are nearly indistinguishable from the fully enabled PhyloScan. A comparison of the \"PhyloScan (1 clade)\" curves to the \"MONKEY (1 clade)\" curves shows that there is value in combining evidence from multiple sites within an intergenic region using the Neuwald-Green calculation. A comparison of the \"PhyloScan (4 clades)\" curves to the \"PhyloScan (1 clade)\" curves indicates that there is additional value in considering data from multiple clades. For instance, if p-value cutoffs are chosen so that type I error is 0.1% (i.e., the specificity is 99.9%) then PhyloScan correctly classifies 99.85% of the full-strength-Crp intergenic regions, 72.68% of the 1/2-strength regions, and 32.64% of the 1/3-strength regions. The corresponding numbers for \"PhyloScan (1 clade)\" are 96.98%, 33.01%, and 10.11%. The corresponding numbers for MONKEY are 79.02%, 21.66%, and 6.33%. It is possible that sensitivities for the four-clades curves would have been even stronger if we had not prohibited the non-Enterobacteriales clades from rescuing intergenic regions in the Enterobacteriales clade that had failed to pass our 0.05 p-value cutoff.\n\nReal sequence data\nTo evaluate the statistical power provided by different facets of the PhyloScan approach in real sequence data, we measured the increase in sensitivity originating from three sources: a reduction in database size, the use of aligned sequence data only, and the use of non-alignable ortholog data.\nAs a stripped-down baseline, we applied PhyloScan in a scan of the full E. coli sequence database, ignoring all other sequence data; this baseline is equivalent to the original Staden method, and thus has the same statistical power.\nWe compared the baseline to the results achievable from a reduced database. When orthologous sequences are aligned between closely related species, gaps may be introduced, and there are often portions of the sequence that do not align; thus, the overall feasible search space for transcription factor binding sites is reduced. A search of such a reduced database in and of itself will allow the detection of more statistically significant transcription factor binding sites than will a search of a full set of intergenic regions from a single species. Therefore, the scanning results from a database reduced in size, yet containing data from only one species, will provide a measure of the increase in sensitivity to the baseline scan that is due simply to a reduction in search space.\nWe compared the baseline and reduced-database results to those obtained by scanning a database of aligned E. coli-S. typhi sequences, in order to measure the increase in sensitivity provided by the use of this aligned sequence data.\nTo test these sources of statistical power, we generated databases of promoter-containing E. coli intergenic regions, aligned E. coli-S. typhi intergenic regions, and motif models based on known Crp and PurR sites (see Methods). Specifically, the three databases contained: (1) the set of all E. coli intergenic regions, (2) the E. coli sequences extracted from the alignments of E. coli-S. typhi orthologous intergenic regions, and (3) the E. coli-S. typhi aligned intergenic regions data. Relative to the original method of Staden, our results show large improvement in the number of predicted transcription factor binding sites due to the alignment of two somewhat closely related species (Table 1 and Figures 4 and 5). Specifically, with a q-value cutoff of 0.001 (see Methods) the scanning of the set of all E. coli intergenic sequences results in only one Crp-significant intergenic region (with two predicted Crp sites), and one PurR-significant intergenic region (with one PurR site). No improvement was obtained in the reduced database of E. coli intergenic sequences. However, when the set of E. coli-S. typhi aligned sequences was scanned, 10 Crp-significant intergenic regions (with 13 Crp sites total), and 12 PurR-significant intergenic regions (with 13 PurR sites total) were predicted.\nTable 1 Summary of PhyloScan Predictions\nC1 C2 C3 C4 C5 C6\nE. coli Sequence Data Fulla Fulla Red.b Red.b Red. \u0026 Alignedc Red. \u0026 Alignedc\nIndep. Species No Yes No Yes No Yes\nCrp Knownd 1(2) 7(10) 1(2) 8(12) 4(6) 11(16)\nCrp Noveld 0(0) 16(20) 0(0) 16(18) 6(7) 18(21)\nPurR Knownd 1(1) 9(9) 1(1) 11(11) 9(9) 12(12)\nPurR Noveld 0(0) 4(5) 0(0) 4(5) 3(4) 6(7)\nThis table shows the number of E. coli intergenic regions predicted by PhyloScan to contain Crp or PurR binding sites, with the total number of sites predicted within parentheses. Column C1 is for a scan of the full set of E. coli intergenic sequence data (excluding the S. typhi sequence data and the sequence data from the other, independent clades). Column C3 is for a scan of only that E. coli sequence that is alignable with S. typhi; the S. typhi sequence data continue to be excluded. Column C5 is for a scan of the aligned E. coli-S. typhi sequence data. Columns C2, C4, and C6, are like Columns C1, C3, and C5, respectively, but the sequence data from the independent clades are also incorporated. Observing the lack of improvement of Column C3 over Column C1 (or the meager improvement of C4 over C2), we conclude that there is minimal gain in sensitivity from considering only E. coli sequence that is alignable with S. typhi, when not actually using the aligned S. typhi sequence data. Observing the modest improvement of C5 over C3 (or C6 over C4), we conclude that incorporating the aligned S. typhi sequence gives a moderate gain in sensitivity. Observing the large improvement of C2 over C1 (or C4 over C3, or C6 over C5), we conclude that incorporating the data from species that are not alignable with E. coli gives a significant gain in sensitivity. Notes: aDatabase of 2379 intergenic sequences from E. coli [see Additional file 2]. bDatabase of E. coli sequences (reduced search space) extracted from the E. coli-S. typhi database (see Real Sequence Data in Results). cDatabase of E. coli-S. typhi aligned intergenic sequences (see Real Sequence Data in Results). dThe number of E. coli intergenic regions predicted by PhyloScan to contain Crp or PurR binding sites, where the total number of binding sites detected is in parentheses and those sites that correspond to known, experimentally verified transcription factor binding sites and those sites that are novel (not yet verified) are indicated.\nFigure 4 Crp-Significant Intergenic Regions Found. When counting Crp-significant intergenic regions, comparison of the bars labeled \"+\" (with the unalignable sequences) relative to those labeled \"-\" (without the unalignable sequences) indicates that the largest gain in sensitivity comes from the use of unalignable, evolutionarily distant sequences. The left part of this figure shows the sensitivity for the scan of E. coli data only. The center part of this figure shows the sensitivity from the scan of only those E. coli sequence data that are alignable with S. typhi. The right part of this figure shows the sensitivity from the scan of E. coli-S. typhi aligned sequence data.\nFigure 5 PurR-Significant Intergenic Regions Found. The results for PurR are similar to those for Crp. See the caption of Figure 4. Furthermore, in each of the tests described above (using the baseline, the reduced-database, or the aligned sequence data) we can incorporate non-alignable orthologous sequence data to measure the impact of these additional data on sensitivity. Thus, to determine the extent to which additional, more distantly related, species could provide evidence to support a particular candidate transcription factor binding site upstream of a particular gene in the target species, we used PhyloScan to scan the orthologous intergenic regions for that candidate gene from the additional species (clades), assuming phylogenetic independence between clades. The p-value representing the combined evidence supporting a transcription factor binding site prediction was then calculated using the method of Bailey and Gribskov [32], as described in the Methods.\nTo demonstrate this approach with the E. coli Crp and PurR examples, we employed orthologous data from the five additional gamma-proteobacterial species listed above. We used PhyloScan to identify potential Crp and PurR transcription factor binding sites in the E. coli-only and E. coli-S. typhi aligned data sets, using a Pintergenic ≤ 0.05 cutoff to select candidate intergenic regions for examination in the other five species. As summarized in Table 1, depicted in Figures 4 and 5, and described below, we observed a considerable increase in the number of predicted transcription factor binding sites at the q-value ≤ 0.001 level, when the evidence from the five additional gamma-proteobacterial species was included by combining p-values.\nFor example, PhyloScan identified a total of 10 Crp-significant intergenic regions in the E. coli-S. typhi aligned data, but after combination of the evidence from the remaining five species, a total of 29 Crp-significant intergenic regions were predicted, a near tripling. Compared to a simple search of the raw E. coli intergenic sequences (one Crp-significant intergenic region), this represents a tremendous increase in sensitivity. The results with the PurR model were also dramatic: the use of data from S. typhi, Y. pestis, H. influenzae, and V. cholerae provided a 50% increase in the number of PurR-significant intergenic regions (to 18 from 12), compared to the scanning of E. coli-S. typhi aligned intergenic sequences only. In the E. coli sequence alone there was only a single PurR-significant intergenic region. In the Supplementary Materials are tables listing the located sites for Crp [see Additional file 3] and PurR [see Additional file 4], as well as captions for these tables [see Additional file 1].\nWe also examined the best 20 reported intergenic regions for each of the six approaches shown in Table 1. We see several differences, not only in the reported q-values, but also in the order and appearance of predicted binding sites in intergenic regions; see the caption of Table 2 for more details.\nTable 2 Top 20 Predictions by PhyloScan\nC1 C2 C3 C4 C5 C6\nE. coli Sequence Fulla Fulla Reducedb Reducedb Reduced \u0026 Alignedc Reduced \u0026 Alignedc\nIndep. Species No Yes No Yes No Yes\nRank Gene log(q) Gene log(q) Gene log(q) Gene log(q) Gene log(q) Gene log(q)\n1 yibI -4.65 cdd -9.28 mtlA -5.14 mtlA -9.76 mtlA -7.66 mtlA -12.15\n2 yqcE -2.86 glpT -7.21 ygcW -2.89 cdd -9.60 yjcB -4.55 glpA -9.19\n3 b1904 -2.61 mglB -6.01 yjcB -2.62 glpA -8.31 gcd -3.99 cdd -9.16\n4 fucA -2.51 yibI -5.26 yjiY -2.60 mglB -6.53 b2146 -3.97 mglB -7.60\n5 deaD -2.51 yjiY -4.57 b2146 -2.53 gapA -5.21 fucA -3.93 udp -6.26\n6 yjiY -2.42 hemC -4.38 fucA -2.51 udp -5.17 ygcW -3.42 gapA -6.02\n7 cdd -2.29 deaD -4.35 deaD -2.47 yjiY -4.79 flhD -3.03 yjcB -5.09\n8 yeaA -2.22 ysgA -4.33 cdd -2.31 cyaA -4.70 gapA -3.03 cyaA -5.04\n9 yhcR -2.06 yhcR -3.99 gapA -2.22 deaD -4.37 ycdZ -3.01 malE -4.83\n10 ycdZ -1.96 yqcE -3.56 qseA -2.03 malE -4.29 udp -2.78 ycdZ -4.69\n11 b2736 -1.87 adhE -3.47 ycdZ -1.98 ygcW -3.63 b2248 -2.76 adhE -4.56\n12 uxaC -1.81 ycdZ -3.45 mglB -1.90 adhE -3.58 glpA -2.76 b2146 -4.53\n13 ysgA -1.77 yeaA -3.44 udp -1.86 ycdZ -3.52 mglB -2.73 fucA -4.46\n14 glpT -1.75 mlc -3.37 uxaC -1.85 mlc -3.48 qseA -2.68 pckA -4.09\n15 mglB -1.63 b1904 -3.31 glpA -1.84 fucA -3.32 pckA -2.36 aer -3.97\n16 pckA -1.39 fucA -3.23 pckA -1.45 yjcB -3.32 adhE -2.14 ygcW -3.78\n17 serA -1.23 b2736 -3.18 malE -1.36 pckA -3.23 aer -2.13 gcd -3.67\n18 aer -1.23 pckA -3.17 aer -1.32 aer -3.17 cdd -2.10 deaD -3.65\n19 adhE -1.22 aer -3.08 serA -1.32 qseA -3.07 deaD -2.04 serA -3.62\n20 mlc -1.01 yjeG -3.05 adhE -1.28 uxaC -3.07 uxaC -2.02 mlc -3.62\n# Diffs from C6 10 11 3 3 4 0\nBecause it is sometimes instructive to examine a fixed number of top hits regardless of the reported q-values, in this table we compare the six approaches' best 20 intergenic regions for Crp. By comparing each column to Column C6, which is the best approach we employed, we see that the C1-C5 approaches give significantly different q-values for, and orderings of, the predicted regulated genes. As indicated in the bottom row, the C1-C5 approaches miss several of the top-20 genes reported in C6, replacing them with genes that did not make the C6 top-20 list. In particular, although it uses all of the sequence data except S. typhi, C2 is significantly different from C6. Furthermore, although C3 has few differences from C6 in the set of genes indicated, the q-values of C3 are considerably worse and the gene order is substantially rearranged. These data suggest that the ability to simultaneously handle both aligned and unaligned data is important in obtaining accurate predictions. Notes: abcSee the caption notes for Table 1. Also see the Table 1 caption for descriptions of Columns C1-C6. It is worth noting here that the non-alignable species were selected for combination of p-values based upon the presence or absence of the transcription factor under study. All gamma-proteobacteria used in this study encode orthologs to Crp; hence, data for all species were included when p-values were combined from scans with the Crp motif. In contrast, because S. oneidensis and P. aeruginosa do not encode PurR orthologs, these species were not considered when we scanned for PurR binding sites."}