Results
Our methodology requires a preliminary TRN which is used as the training set in all three methodologies presented below. We gathered this training set from EcoCyc [14]. EcoCyc describes E. coli operons, promoters, TFs, and TF binding sites. The database describes the mechanisms of transcriptional regulation of E. coli genes, and contains the most complete description of the genetic network of any organism. EcoCyc and RegulonDB [36] are curated to ensure that their data content is the same. The preliminary TRN used in this study included 984 genes, 144 TFs, and 2007 gene/TF interactions. Out of 2007 gene/TF interactions, 1124 were up regulation, 766 were down regulation, 5 were uncertain, and 112 were dual regulation (both up/down). Basic properties of the preliminary E. coli TRN are illustrated in Fig. 5.
Figure 5  a) Probability distribution for the number of genes regulated by a given TF, b) probability distribution for the number of gene/TF interactions per gene. These graphs are based on the preliminary TRN taken from [14]. We applied the FTF methodology to E. coli using expression datasets obtained from the NIH omnibus service: GSE7 (physiological and genetic changes that affect tryptophan metabolism), GSE8 (chromosomal replication forks in synchronized cells) and GSE9 (UV exposure). These 65 sets were chosen as the experiments were performed on the same platform. One single run of FTF on a PC (Xeon 2.4 GHz) takes about 15 minutes and requires 700 MB memory. The probability distributions for the absolute value of the Pearson correlation coefficient between the constructed TF activities (using equation 2) and expression data are shown in Fig. 6 for both the training and random sets. A comparison of Fig. 6 and Fig. 1 shows that by constructing TF activities using a preliminary TRN, we significantly increase the amount of information extracted from expression data.
Figure 6  Probability distribution of FTF similarity scores of the training set (dashed) with respect the random set (solid). x-axis refers to FTF similarity score while y-axis refers to its probability distribution. A comparison with Fig. 1 (diamond markers) shows that our approach is superior to the gene-gene correlation approach. Using the biological process ontology developed by the Gene Ontology Consortium, we calculated GO similarity scores. We then calculated gene/TF scores using the approach described in From Gene-Gene Scores to Gene-TF Scores Section. Fig. 7 shows the probability distributions for the training (gene/TF interactions in the preliminary TRN) and complete (all possible gene/TF interactions) sets. The significant variation between the training and random sets provides evidence that the likelihood for a gene pair to be regulated in the same manner increases with the similarity of their GO description. A comparison of Fig. 7 and Fig. 2 of Wu et al. (2005) shows that our approach is more successful in distinguishing the training and random sets (Note that [20] included pathway data in their training set whereas we only used the E. coli TRN).
Figure 7  Comparison of the probability distributions of GO similarity scores of the training set (square markers) and the random set (diamond markers). The training set consists of all known E. coli gene/TF interactions for those genes with GO terms assigned. The random set consists of all possible gene/TF interactions for those genes with GO terms assigned. It is seen that higher GO similarity score implies higher likelihood of a gene/TF interaction, particularly when the GO similarity score is larger than 8. We extended the number of genomes used in the phylogenic similarity analysis from 134 to 229 and used the E. coli TRN as the training set in contrast to the gene-gene pair training set suggested by [20]. Fig. 8 shows the probability distributions for the training (gene/TF interactions in the preliminary TRN) and complete (all possible gene/TF interactions) sets. Phylogenic similarity outperforms the GO and FTF methodologies. As in the case for GO similarity, the results are better than those obtained earlier (Fig. 4 of [20]) due to the gene-TF versus the gene-gene based approach.
Figure 8  Comparison of the probability distributions of Phylogenic Similarity scores of the training set (dashed) and the random set (solid). x-axis refers to Phylogenic Similarity Score while y-axis refers to its probability distribution. The training set is based on all known gene/TF interactions from [14]. The random set consists of all possible gene/TF interactions. It is seen that higher score implies higher likelihood of a gene/TF interaction, particularly when the similarity score is greater than 500. The probability distributions of the integrated confidence score for the training and complete gene/TF sets are shown in Fig. 9. We applied a threshold of 1.3 to this score to find the most likely gene/TF interactions. To facilitate the use of our results by the research community, they are posted at [37] where users can view/download the results and perform search queries. As our procedure is automated, when new information and microarray or other data become available, the entire procedure can be readily repeated.
Figure 9  Probability distribution of combined scores for the training set (dashed) and the random set (solid). The training set is based on all known gene/TF interactions from [14]. The random set consists of all possible gene/TF interactions. It is seen that higher combined score implies higher likelihood of a gene/TF interaction. To provide an objective measure of deviations between two probability distributions, we calculated the chi-square scores for GO, phylogenic, and FTF analysis as well as the final integrated probability distributions (Figs. 6, 7, 8, 9). We created 4 bins for all distributions and calculated the number of gene/TF scores in each bin. Note that a chi-square score of 16.27 gives a p-value of 0.001 for a system with three degrees of freedom (number of bins minus one). We found the chi-square scores to be 49667 (phylogenic similarity), 13005 (GO), 579 (FTF), and 79584 (integrated). These scores indicate and GO and phylogenic similarity measures provide better predictions than expression analysis. Higher chi-square score for the integrated probability distributions justifies the integration scheme. A cross examination of scores from different methodologies has shown that if a gene/TF interaction scores high for one of the three methodologies, this doesn't imply that the remaining two methods will support this prediction. For example, out of the 1000 highest phylogenic similarity scores, only 48 and 3 of them were found in the top 1000 GO and FTF scores.
The suggested TRN includes 3694 new gene/TF interactions. If the training TRN is a random sampling of the actual TRN, then, for a sufficiently large training TRN, it is expected to exhibit the basic functional properties of the actual TRN. The suggested TRN is denser than the training TRN. However, as illustrated in Fig. 10, probability distributions for the number of gene/TF interactions per gene for both the training and suggested TRNs show a high degree of similarity. Clearly, our training set is vastly incomplete. Not only we don't have any regulatory information for over 3,000 genes, but we likely know only a fraction of the number of TFs regulating those 984 genes for which at least one regulating TF is known. Therefore, the true E. coli TRN is likely to be denser, as predicted here.
Figure 10  Probability distributions for the number of gene/TF interactions per gene. Although the suggested TRN is denser, the overall shape of the probability distribution remains the same. After we performed the calculations we found 206 more gene/TF interactions in the RegulonDB and EcoCyc databases that were not included in the training set. 44 out of 206 regulatory interactions were predicted by our methodology. Out of 44 interactions, the nature of regulation was correctly predicted for 33 of them. Regulation type couldn't be obtained for 7 interactions. Regulation nature was incorrectly predicted for the remaining 4 interactions (Table 2). We obtained the p-value for predicting at least 44 out of 206 gene/TF interactions to be less than 1.0e-50 (expected proportion = 3.5e-04, number observed = 44, sample size = 3694).
Table 2  Out of 206 gene/TF interactions found in the RegulonDB (Salgado et al. 2004) and EcoCyc databases, 44 scored higher than the imposed threshold.
TF  Gene  Final Score  predicted sign  actual sign
1  ArcA-Phosphorylated  yfiD  1.670768591  up  up
2  ArgR-L-arginine  argG  2.624262246  down  down
3  CRP-cAMP  ugpQ  2.085693237  up  up
4  CRP-cAMP  ugpC  1.438960585  up  up
5  CRP-cAMP  ugpB  1.527292985  up  up
6  CRP-cAMP  rhaB  2.909109432  up  up
7  CRP-cAMP  cytR  2.119101207  up  up
8  CRP-cAMP  fis  1.887330412  up/down  up/down
9  FhlA-Formate  hyfA  2.509458668  up  up
10  Fis  relA  1.595459732  up  up
11  GntR  idnK  1.591861262  down  down
12  Lrp  livJ  1.380451883  down  down
13  MalT-Maltotriose-ATP  malZ  1.44543234  up  up
14  NarL-Phosphorylated  fdhF  2.059069186  up  up
15  NtrC-Phosphorylated  astC  2.256990592  up  up
16  ArgR-L-arginine  astC  1.505034  down  up
17  CRP-cAMP  nagE  1.501987378  up/down  up
18  CRP-cAMP  rpoS  2.319114034  up/down  down
19  OmpR-Phosphorylated  nmpC  1.572242248  up  down
20  ArcA-Phosphorylated  aceE  1.998375233  down  down
21  ArcA-Phosphorylated  appC  2.066734924  up  up
22  CRP-cAMP  entC  2.93996178  up  up
23  CRP-cAMP  fepA  3.821927836  up  up
24  CRP-cAMP  fumA  3.496049117  up  up
25  CRP-cAMP  gadB  2.197333426  down  down
26  CRP-cAMP  galP  2.37334941  up  up
27  CRP-cAMP  gapA  1.811986697  up  up
28  CRP-cAMP  ompF  1.315706561  up  up
29  FruR  acnA  1.633652931  up  up
30  FruR  glk  2.112996214  down  down
31  GadE  gadB  1.973757857  up  up
32  Hns  gadB  1.78680429  down  down
33  LexA  uvrC  1.590391147  down  down
34  MetJ-S-adenosylmethionine  metE  3.264214502  down  down
35  MetJ-S-adenosylmethionine  metR  1.611151228  down  down
36  MetR-Homocysteine  metE  3.45265202  up  up
37  PhoP-Phosphorylated  rstA  1.6475787  up  up
38  CRP-cAMP  prpR  3.147648347  up/down  up
39  Fnr  fdhF  1.985483486  up/down  up
40  CRP-cAMP  nagE  1.501987378  up/down  up
41  NarP-Phosphorylated  fdhFp  1.645007039  up/down  up
42  NarP-Phosphorylated  fdnG  2.778075473  up/down  down
43  Fnr  dcuS  1.359810967  down  up
44  Fnr  purM  1.427076396  up  down
The p-value (using binary distribution) is found to be less than 1.0e-50. We also used the gene expression data (described above in the microarray analysis section) to further test the suggested TRN as follows. We obtained approximate TF activities for both the training and suggested TRNs. Then, for each gene we calculated the linear correlation coefficient between the expression data and the sum of TF activity profiles (accounting separately up versus down regulation). Higher scores indicate better consistency between expression data and TRN. The average scores for the training and suggested TRNs were calculated to be 0.47 and 0.54, respectively, showing an improvement in the overall consistency of the TRN with gene expression profiles.