PMC:1794230 / 34896-39325 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"17244358-12883005-1689320","span":{"begin":2738,"end":2740},"obj":"12883005"},{"id":"17244358-2720468-1689321","span":{"begin":2978,"end":2980},"obj":"2720468"}],"text":"Key user-selectable parameters in PhyloScan\n\nFocus on a target species or clade\nIn running PhyloScan, the user must specify two cutoff values, and can optionally specify additional parameters describing the expected multiplicity of binding sites upstream of a regulated gene. The first cutoff is a p-value cutoff, calculated on a per intergenic-sequence basis for the clade that includes the species of primary interest. We chose a default value of 0.05, so that weak intergenic regions in the target species' clade will not be considered, even when strong intergenic regions are located in orthologous regions in more-distantly related species. The choice of a larger value would reduce the focus on the target species, allowing strong sites in other species to rescue weak sites in the target species. The choice of a smaller value would increase the focus on the target species; the choice of a very small value would effectively cancel out the information available from the related species, since any intergenic region that looks extremely promising in the target species will almost surely continue to look promising when additional data are included.\n\nQuality of reported sites\nThe second cutoff that our approach requires is the q-value cutoff that specifies which sites will be reported. We chose a default value of 0.001, meaning that according to our model, at most 0.1% of the intergenic sequences that we report as binding the transcription factor are chance false positives. While we have incorporated a fairly accurate phylogenetic model, we have not incorporated into this model such effects as the non-independence of the positions in a site (e.g., the effect of di- or tri-nucleotide energy terms, also known as stacking energies), nor effects from the cooperative binding of multiple transcription factors on the ability of a factor to bind to a DNA site. Because our model does not capture these and other features, the actual rate of false positives is likely to be higher than 0.1%.\nOn the other hand, in calculating the q-value, we have assumed that the vast majority of intergenic sequences in a genome will likely not contain a transcription factor binding site for the particular transcription factor under study, i.e., we are looking for rare events. Under this assumption, the proportion of all intergenic sequences that are truly null will approach 1.0 in Storey and Tibshirani's q-value calculation (the π^0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaBaaaleaacqaIWaamaeqaaaaa@2F9A@ term of [16]), and so does not appear in our q-value equation (see Methods). In a case where this assumption does not hold, the q-values provided by our approach will be overly conservative.\nNote that the scan technology, first described by Staden [10] and employed here, is a frequentist hypothesis testing approach. A Bayesian approach presents an alternative through the use of Bayesian posterior probabilities for each site. Such an approach would require the specification of a model from which alternative sequences are drawn as well as null sequences. When a large number of observations are available the approach of Efron et al. [36] provides a compromise that yields local false discovery rates through the use of empirical Bayesian methods.\n\nThe number of sites per intergenic region\nThe number of potential sites to consider in each intergenic region, and their respective weights, are additional parameters that can be set by the user to best capture the underlying biology in the system under study. Generally speaking, for i ≥ 1, the algorithm detects that an intergenic region with sites is significant when its ith best site is surprisingly strong given its rank as the ith best site. The weight wi should be chosen in proportion to the number of such intergenic regions that are expected to have i as the first/lowest rank that appears strong by this test. We have set the default to have weights (w1, w2) = (0.9, 0.1) under the assumption that approximately 90% of intergenic regions with sites will have a strong site; among the remaining intergenic regions with sites, nearly all will have a site that is surprisingly strong given its rank as second strongest. (See the Methods.)"}