Background The growing number of gene expression datasets and availability of hundreds of bacterial genomes accelerated the quest for the construction of bacterial transcriptional regulatory networks (TRNs). In most prokaryotic genes, transcription initiation is controlled by DNA sequence elements recognized by RNA polymerase. The activity of RNA polymerase (RP) is regulated through interaction with transcription factors (TFs) which alter the binding affinity of RP. Discovery of TRNs advances our understanding of mechanisms of cellular processes and responses, and is of particular importance in biotechnical applications and identifying the nature of diseases from a genome-wide perspective. Our objective in this work is to develop a robust methodology to use known TRN information as a training set and augment it by discovering new gene/TF interactions using a variety of approaches integrated via an objective Bayesian scheme. We apply the methodology to E. coli as it is believed to have the most well understood TRN; therefore it serves as an excellent test case. However, out of roughly 4300 genes and around 300 predicted TFs [1], the current E. coli TRN includes only 984 genes and 144 TFs. Hence, it is clear that we only know a fraction of the network. According to Babu and Teichmann three-quarters of the TFs are two-domain proteins, i.e., DNA-binding domain and regulatory domain (mostly for small molecules), showing the importance of TFs in adapting to environmental conditions [1]. Like most biological interaction networks, the E. coli network seems to follow a power law (scale free) distribution, suggesting that TRNs tend to be connected among high-degree nodes and low-degree ones [2]. Another important property of TRNs is the statistically overrepresented network motifs. Shen-Orr et al. showed that the feed forward motif (two TFs co-regulating one gene and one TF regulating the other) is overrepresented by a factor of 8 in the known E. coli TRN [3]. These studies advance our understanding of design principles in bacterial TRNs. However, they do not have a direct impact on the construction of TRNs. There have been numerous approaches to TRN inference from gene expression data. Most studies considered gene-gene networks rather than gene-TF networks. Among them are principal component analysis [4] and independent component analysis [5]. Network component analysis (NCA) is a TF-based methodology which differs from other techniques in that the structure of the gene regulatory network is assumed to be known [6]. Therefore, NCA's use is limited to cases in which the network is fairly well known and has strong structural limitations. In reality, only an incomplete and possibly biased TRN is available due to the limited spectrum of experimental conditions imposed. Gardner et al. proposed a methodology to construct the gene-gene control network structure for small networks using microarray data, limiting the number of interactions per gene [7]. We tested a similar approach for large networks and showed that even when there are just a few interactions per gene, there can be thousands of networks that can explain the same microarray data with essentially the same accuracy. Kyoda et al. developed a methodology that employs mutation experiments to arrive at the TRN [8]. However, it is questionable whether their approach can be applied to large TRNs. Liang et al. presented a methodology for Boolean networks and applied it to a small 50 gene system with at most 3 interactions per gene [9]. Boolean networks are an oversimplification of gene expression as they use a binary approximation (fully on or off) [10]. Cluster analysis is based on statistical techniques wherein correlations are sought between the responses of genes [11,12]. However the coordination can be extremely complex and circuitous, i.e. genes may be involved in a multi-branch feedback loop with several TFs made or activated/deactivated by the proteins they encode. These time-delayed, complex relationships are revealed by our methodology as it discovers and quantifies many of these feedback relationships. Although cluster analysis might suggest groups of genes that may be involved in related pathways, it is not an accurate methodology to suggest gene/TF interactions. D'haeseleer et al. applied clustering based on the correlation of microarray data [13]. To assess the feasibility of inferring gene-gene networks from expression data only, we used two independent gene expression data sets and a TRN for E. coli [14]. We calculated the linear correlation of genes that encode a TF and genes that are known to be regulated by the same TF. We also obtained correlation coefficients for all gene-gene pairs. Fig. 1 shows the probability of correlation between two randomly chosen genes and that for known pairs with similar known gene/TF interactions. Throughout the manuscript we compute probability densities. These probability density functions are normalized to have unit area although their value at any score can exceed unity (∫−∞∞p(x′)dx′=1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWdXbqaaiabdchaWjabcIcaOiqbdIha4zaafaGaeiykaKIaemizaqMafmiEaGNbauaaaSqaaiabgkHiTiabg6HiLcqaaiabg6HiLcqdcqGHRiI8aOGaeyypa0JaeGymaedaaa@3C5A@). The actual probability can then be calculated by taking the integral of the function p(x) by the integration interval of the input variable x. The similarity of these distributions demonstrates that successful reconstruction of the network using expression data alone does not seem likely. Mutual information seems to have similar limitations [15]. However, this does not mean that correlation and mutual information-based methods are not able to discover interesting gene-gene relationships; rather their potential to infer gene/TF interactions is very limited. Therefore, the main assumption in constructing gene-gene networks, i.e. that the TF activity follows the expression of the encoding gene seems to be unreliable. We address this problem by constructing approximate TF activity profiles using a preliminary TRN as discussed below. Figure 1 Probability distribution for correlation (Pearson) between a random pair and known gene/TF regulatory interaction for E. coli. Square markers refer to the dataset obtained from the U. of Oklahoma E. coli database. Diamond markers refer to the datasets obtained from the NIH omnibus service (GSE7, GSE8, GSE9; 65 datasets). The solid and hollow markers show the probability distribution for correlation between a random gene pair and known gene/TF regulatory interaction, respectively. As these probability distributions are indistinguishable, it does not seem feasible to construct the TRN using expression data alone. We also calculated probability distributions for mutual information which yielded similar findings. The difficulty with the above studies is the gap between the complexity of the network and the quantity of information in just one methodology. The solution is to use as much information as possible to rule out spurious networks. Segal et al. assumed that genes in the same pathway are activated together and their protein products often interact [16]. This led them to the use of protein-protein interaction information in their predictions. Brazma et al. studied the similarities of the upstream regions of genes that have a similar expression pattern [17]. A similar study was presented by Haverty et al. who used statistical methods for identifying overabundant TF binding motifs (from TRANSFAC and JASPER) and microarray data to infer the TRN [18]. Lee et al. presented a conceptual framework to integrate diverse functional genomics data (including expression data, gene-fusions, phylogenetic profiles, co-citation, and protein interaction data) and applied it to investigate gene-gene network in Saccharomyces cerevisiae [19]. The major difference between [19] and this work is that we are interested in constructing gene/TF networks rather than gene-gene networks. Gene ontology (GO) and phylogenic similarity as approaches to functional module prediction have been explored by [20]. This work is based on the hypothesis that a pair of genes with high GO or phylogenic similarity score is likely in the same functional module (operon or regulon). In this study, we extend their work to include gene expression analysis, and focus on TRN construction. We show that GO and phylogenic similarity can be used to greatest advantage if they are based on a gene/TF interaction model.