2.3. Analysis of Enriched Transcription Factor Binding Sites
Transcription factor binding sites in promoters of differentially expressed genes were analyzed using known DNA-binding motifs described in the TRANSFAC® library, release 2014.4 (BIOBASE, Wolfenbüttel, Germany) [23]. The geneXplain platform provides tools to firstly identify a set of important motifs with occurrences that are enriched in the study promoters as compared to a suitable background sequence set, e.g., composed of promoters whose genes were not differentially regulated in the condition of the experiment. In the following, we denote study and background sets briefly as Yes and No sets. The algorithm for transcription factor binding site (TFBS) enrichment analysis has been described in Kel et al. [9]. For each library motif, the procedure finds a score threshold that optimizes the Yes/No ratio RYN as defined in Equation (1) under the constraint of statistical significance.
(1)   In Equation (1), #Sites and #Seq are the sites and sequences counted in Yes and No sequences. A higher Yes/No ratio indicates stronger enrichment of binding sites for a motif in the Yes sequences. One may count all binding sites that occur at a certain threshold and calculate a statistical significance using the one-tailed binomial test (2)  or one can count only one site for at least one occurrence per sequence and apply the one-tailed Fisher test (3)  where K denotes the number of sequences with at least one site, k are the Yes sequences with a site and M = #SeqYes. To statistically correct the Yes/No ratio in order to achieve a better ranking of motifs according to their importance, Stegmaier et al. [11] described an extension that makes use of the Beta ratio distribution. For improved computational speed, the algorithm incorporated in the geneXplain platform corrects the Yes/No ratio to the lower bound of a chosen confidence interval assuming that the log-Yes/No ratio approximately has a normal distribution [24].
(4)    R Y N  99 %   = exp ( log ( R Y N    )  −   α 99 %   ∙ S E  )     (5)  S E =    1 # S i t e s Y e s     + 1 # S i t e s N o     + 1 # S e q Y e s     + 1 # S e q N o           For the 99%-confidence interval, the geneXplain platform uses an α-value of ~2.576. As an alternative to this approximation and to the Beta ratio-approach [11], one can calculate (6) RYN99%,Beta=#SeqNo#SeqYes+#SeqNo/QBeta(.99;α=#SitesNo+1, β=#SitesYes+1) where QBeta is the quantile function of the Beta distribution. This formula makes use of the Beta distribution for the site proportions whereas the sequence proportion is treated as constant. To our knowledge, these are currently the only described methods that provide a correction for the Yes/No ratio. The speed gain of Equations (4) and (5) over numerical calculation of the quantile of the Beta ratio distribution as described in [11] is substantial. We randomly sampled 1000 parameter sets, each with two values in the interval [1,200] representing binding site counts and two values, 500 and 1000, representing Yes and No sequences. Correction of the log-Yes/No ratio Equation (4), using the Beta distribution quantile Equation (6) or the ratio of Beta distributions [10] for the 1000 parameter sets required, respectively, 0.1 ± 0.008 ms, 10.02 ± 0.14 ms and 19813.5 ± 263.75 ms. Equations (4) and (6) have the additional advantage that their values are not bounded by the relative proportion of Yes sequences. Figure 1A–C compare values returned by the methods for the same parameter sets. The plots show that corrected ratios of all three methods are correlated, where the log-Yes/No ratio correction features some dispersion compared to the methods involving the Beta distribution (Figure 1A,C). This is likely caused by the regularization with a uniform Beta(1,1) distribution. Figure 1D compares Beta ratio quantile values computed numerically for the random parameter sets to sample quantiles obtained by drawing 10,000 samples from corresponding Beta distributions and demonstrates the accuracy of the numerical implementation.
Figure 1  Comparison of different methods for Yes/No ratio correction. (A) Beta ratio correction [10] versus log-Yes/No ratio correction Equation (4). (B) Beta ratio correction versus Beta quantile correction Equation (6). (C) Beta quantile correction versus log-Yes/No ratio correction. (D) Comparison of numerical calculation of Beta ratio quantiles to sampling-based quantile estimates.   In the following, we briefly describe how we validated the performance of this method on the basis of experimentally determined transcription factor binding sites. In over 200 ChIP-seq datasets from the Encode project [25] we have determined the ranks of TRANSFAC® motifs corresponding to respective precipitated transcription factors using different methods to calculate Yes/No ratios as well as binding site scores. A method ought to assign a high rank for the true motifs among all motifs of a library. Figure 2 shows that Yes/No ratio correction led to improved or comparable ranking of the best performing motif of a factor in at least 80% of the datasets (Figure 2A,B), where corrections based on Equations (5) and (6) gave similar results. When no method was able to rank the best motif among the first 10 matrices (Figure 2A,B, 90th percentile), then Yes/No ratio correction could decrease the rank of the best motif by about 2–3 positions for Log-odds scores or more strongly for MATCH scores [26]. The low best ranks at the 90th percentile suggest that in these experiments, binding sites of TFs other than the target factor dominated the bound regions and the target TF may have been associated mainly or in some cases by protein-protein interactions only. Comparing the median ranks of motifs for those TFs which are presented by several motifs in the TRANSFAC® database (Figure 2C,D) the corrected Yes/No ratios clearly outperformed the uncorrected ratios in at least 90% of the datasets. The median rank comparison gives an insight into how a method may perform for patterns that do not optimally describe the target TF’s specificity. It can happen that a database comprises only the motif for a related TF or for a more general family or subfamily to which the factor belongs, which may, however, display some differences to the binding properties of the factor of interest. Hence, the Yes/No ratio correction is provided for an improved ranking of motifs for the vast majority of datasets both with regard to the best ranking motif as well as with regard to the entire set of motifs known for some TF.
Figure 2  Best and median ranks of known motifs at 70th, 80th and 90th percentiles. ChIP-seq datasets were ordered by observed best or median ranks of motifs known for respective target TFs. Log-odds: Binding sites scored using Log-odds scores; Match: Binding sites scored using MATCH [26] scores; CI99: Correction with confidence interval of 99% as in Equation (4); BI99: Correction based on the Beta quantile function as in Equation (6); Site: Enrichment accounted for all binding sites; Seq: Enrichment accounted for sequences with at least on site. (A) Best ranks for site enrichment (B) Best ranks for sequence enrichment (C) Median ranks for site enrichment (D) Median ranks for sequence enrichment.   In the geneXplain platform, binding site enrichment analysis was carried out as part of a dedicated workflow. The background consisted of 300 house-keeping genes. Promoters were extracted by the workflow with a length of 1100 bp (−1000 to 100).
We considered motifs with corrected Yes/No ratio > 1 for further analysis. The workflow further performs a prediction of binding sites in the promoters of target genes with the filtered matrices at best enrichment cut-offs, maps the matrices to potential transcription factors, and generates visualizations of all results.