Results Our methodology requires a preliminary TRN which is used as the training set in all three methodologies presented below. We gathered this training set from EcoCyc [14]. EcoCyc describes E. coli operons, promoters, TFs, and TF binding sites. The database describes the mechanisms of transcriptional regulation of E. coli genes, and contains the most complete description of the genetic network of any organism. EcoCyc and RegulonDB [36] are curated to ensure that their data content is the same. The preliminary TRN used in this study included 984 genes, 144 TFs, and 2007 gene/TF interactions. Out of 2007 gene/TF interactions, 1124 were up regulation, 766 were down regulation, 5 were uncertain, and 112 were dual regulation (both up/down). Basic properties of the preliminary E. coli TRN are illustrated in Fig. 5. Figure 5 a) Probability distribution for the number of genes regulated by a given TF, b) probability distribution for the number of gene/TF interactions per gene. These graphs are based on the preliminary TRN taken from [14]. We applied the FTF methodology to E. coli using expression datasets obtained from the NIH omnibus service: GSE7 (physiological and genetic changes that affect tryptophan metabolism), GSE8 (chromosomal replication forks in synchronized cells) and GSE9 (UV exposure). These 65 sets were chosen as the experiments were performed on the same platform. One single run of FTF on a PC (Xeon 2.4 GHz) takes about 15 minutes and requires 700 MB memory. The probability distributions for the absolute value of the Pearson correlation coefficient between the constructed TF activities (using equation 2) and expression data are shown in Fig. 6 for both the training and random sets. A comparison of Fig. 6 and Fig. 1 shows that by constructing TF activities using a preliminary TRN, we significantly increase the amount of information extracted from expression data. Figure 6 Probability distribution of FTF similarity scores of the training set (dashed) with respect the random set (solid). x-axis refers to FTF similarity score while y-axis refers to its probability distribution. A comparison with Fig. 1 (diamond markers) shows that our approach is superior to the gene-gene correlation approach. Using the biological process ontology developed by the Gene Ontology Consortium, we calculated GO similarity scores. We then calculated gene/TF scores using the approach described in From Gene-Gene Scores to Gene-TF Scores Section. Fig. 7 shows the probability distributions for the training (gene/TF interactions in the preliminary TRN) and complete (all possible gene/TF interactions) sets. The significant variation between the training and random sets provides evidence that the likelihood for a gene pair to be regulated in the same manner increases with the similarity of their GO description. A comparison of Fig. 7 and Fig. 2 of Wu et al. (2005) shows that our approach is more successful in distinguishing the training and random sets (Note that [20] included pathway data in their training set whereas we only used the E. coli TRN). Figure 7 Comparison of the probability distributions of GO similarity scores of the training set (square markers) and the random set (diamond markers). The training set consists of all known E. coli gene/TF interactions for those genes with GO terms assigned. The random set consists of all possible gene/TF interactions for those genes with GO terms assigned. It is seen that higher GO similarity score implies higher likelihood of a gene/TF interaction, particularly when the GO similarity score is larger than 8. We extended the number of genomes used in the phylogenic similarity analysis from 134 to 229 and used the E. coli TRN as the training set in contrast to the gene-gene pair training set suggested by [20]. Fig. 8 shows the probability distributions for the training (gene/TF interactions in the preliminary TRN) and complete (all possible gene/TF interactions) sets. Phylogenic similarity outperforms the GO and FTF methodologies. As in the case for GO similarity, the results are better than those obtained earlier (Fig. 4 of [20]) due to the gene-TF versus the gene-gene based approach. Figure 8 Comparison of the probability distributions of Phylogenic Similarity scores of the training set (dashed) and the random set (solid). x-axis refers to Phylogenic Similarity Score while y-axis refers to its probability distribution. The training set is based on all known gene/TF interactions from [14]. The random set consists of all possible gene/TF interactions. It is seen that higher score implies higher likelihood of a gene/TF interaction, particularly when the similarity score is greater than 500. The probability distributions of the integrated confidence score for the training and complete gene/TF sets are shown in Fig. 9. We applied a threshold of 1.3 to this score to find the most likely gene/TF interactions. To facilitate the use of our results by the research community, they are posted at [37] where users can view/download the results and perform search queries. As our procedure is automated, when new information and microarray or other data become available, the entire procedure can be readily repeated. Figure 9 Probability distribution of combined scores for the training set (dashed) and the random set (solid). The training set is based on all known gene/TF interactions from [14]. The random set consists of all possible gene/TF interactions. It is seen that higher combined score implies higher likelihood of a gene/TF interaction. To provide an objective measure of deviations between two probability distributions, we calculated the chi-square scores for GO, phylogenic, and FTF analysis as well as the final integrated probability distributions (Figs. 6, 7, 8, 9). We created 4 bins for all distributions and calculated the number of gene/TF scores in each bin. Note that a chi-square score of 16.27 gives a p-value of 0.001 for a system with three degrees of freedom (number of bins minus one). We found the chi-square scores to be 49667 (phylogenic similarity), 13005 (GO), 579 (FTF), and 79584 (integrated). These scores indicate and GO and phylogenic similarity measures provide better predictions than expression analysis. Higher chi-square score for the integrated probability distributions justifies the integration scheme. A cross examination of scores from different methodologies has shown that if a gene/TF interaction scores high for one of the three methodologies, this doesn't imply that the remaining two methods will support this prediction. For example, out of the 1000 highest phylogenic similarity scores, only 48 and 3 of them were found in the top 1000 GO and FTF scores. The suggested TRN includes 3694 new gene/TF interactions. If the training TRN is a random sampling of the actual TRN, then, for a sufficiently large training TRN, it is expected to exhibit the basic functional properties of the actual TRN. The suggested TRN is denser than the training TRN. However, as illustrated in Fig. 10, probability distributions for the number of gene/TF interactions per gene for both the training and suggested TRNs show a high degree of similarity. Clearly, our training set is vastly incomplete. Not only we don't have any regulatory information for over 3,000 genes, but we likely know only a fraction of the number of TFs regulating those 984 genes for which at least one regulating TF is known. Therefore, the true E. coli TRN is likely to be denser, as predicted here. Figure 10 Probability distributions for the number of gene/TF interactions per gene. Although the suggested TRN is denser, the overall shape of the probability distribution remains the same. After we performed the calculations we found 206 more gene/TF interactions in the RegulonDB and EcoCyc databases that were not included in the training set. 44 out of 206 regulatory interactions were predicted by our methodology. Out of 44 interactions, the nature of regulation was correctly predicted for 33 of them. Regulation type couldn't be obtained for 7 interactions. Regulation nature was incorrectly predicted for the remaining 4 interactions (Table 2). We obtained the p-value for predicting at least 44 out of 206 gene/TF interactions to be less than 1.0e-50 (expected proportion = 3.5e-04, number observed = 44, sample size = 3694). Table 2 Out of 206 gene/TF interactions found in the RegulonDB (Salgado et al. 2004) and EcoCyc databases, 44 scored higher than the imposed threshold. TF Gene Final Score predicted sign actual sign 1 ArcA-Phosphorylated yfiD 1.670768591 up up 2 ArgR-L-arginine argG 2.624262246 down down 3 CRP-cAMP ugpQ 2.085693237 up up 4 CRP-cAMP ugpC 1.438960585 up up 5 CRP-cAMP ugpB 1.527292985 up up 6 CRP-cAMP rhaB 2.909109432 up up 7 CRP-cAMP cytR 2.119101207 up up 8 CRP-cAMP fis 1.887330412 up/down up/down 9 FhlA-Formate hyfA 2.509458668 up up 10 Fis relA 1.595459732 up up 11 GntR idnK 1.591861262 down down 12 Lrp livJ 1.380451883 down down 13 MalT-Maltotriose-ATP malZ 1.44543234 up up 14 NarL-Phosphorylated fdhF 2.059069186 up up 15 NtrC-Phosphorylated astC 2.256990592 up up 16 ArgR-L-arginine astC 1.505034 down up 17 CRP-cAMP nagE 1.501987378 up/down up 18 CRP-cAMP rpoS 2.319114034 up/down down 19 OmpR-Phosphorylated nmpC 1.572242248 up down 20 ArcA-Phosphorylated aceE 1.998375233 down down 21 ArcA-Phosphorylated appC 2.066734924 up up 22 CRP-cAMP entC 2.93996178 up up 23 CRP-cAMP fepA 3.821927836 up up 24 CRP-cAMP fumA 3.496049117 up up 25 CRP-cAMP gadB 2.197333426 down down 26 CRP-cAMP galP 2.37334941 up up 27 CRP-cAMP gapA 1.811986697 up up 28 CRP-cAMP ompF 1.315706561 up up 29 FruR acnA 1.633652931 up up 30 FruR glk 2.112996214 down down 31 GadE gadB 1.973757857 up up 32 Hns gadB 1.78680429 down down 33 LexA uvrC 1.590391147 down down 34 MetJ-S-adenosylmethionine metE 3.264214502 down down 35 MetJ-S-adenosylmethionine metR 1.611151228 down down 36 MetR-Homocysteine metE 3.45265202 up up 37 PhoP-Phosphorylated rstA 1.6475787 up up 38 CRP-cAMP prpR 3.147648347 up/down up 39 Fnr fdhF 1.985483486 up/down up 40 CRP-cAMP nagE 1.501987378 up/down up 41 NarP-Phosphorylated fdhFp 1.645007039 up/down up 42 NarP-Phosphorylated fdnG 2.778075473 up/down down 43 Fnr dcuS 1.359810967 down up 44 Fnr purM 1.427076396 up down The p-value (using binary distribution) is found to be less than 1.0e-50. We also used the gene expression data (described above in the microarray analysis section) to further test the suggested TRN as follows. We obtained approximate TF activities for both the training and suggested TRNs. Then, for each gene we calculated the linear correlation coefficient between the expression data and the sum of TF activity profiles (accounting separately up versus down regulation). Higher scores indicate better consistency between expression data and TRN. The average scores for the training and suggested TRNs were calculated to be 0.47 and 0.54, respectively, showing an improvement in the overall consistency of the TRN with gene expression profiles.