Microarray analysis
Kinetic cell models hold great promise for predicting cell behavior [28-32]. Unfortunately there is a lack of information about many of the rate and equilibrium constants for the reaction and transport processes involved [33,34]. Simultaneously calibrating all the reaction/transport rate parameters and discovering the gene/TF interaction network structure from available data does not appear to be feasible. Therefore, instead of using a kinetic approach as a basis of TRN construction, we have developed FTF (Fast Transcription Factor analyzer) for network construction via (1) TF activity estimation, (2) statistical arguments, and (3) a preliminary TRN. Once a reliable TRN is obtained using FTF, it can then be used to calibrate the rate and equilibrium constants that appear in transcription/translation kinetic models. An example of such an approach is available at [35].
FTF was designed based on the following notions:
• a method based on TFs has the advantage that microarray noise, and errors in preliminary TRN, can be overcome by statistics – i.e. the regulation of many genes by a given TF;
• due to data uncertainty, there is not usually enough information content in many single-gene responses to unambiguously determine the effect of all TFs regulating it; and
• TRN discovery requires many automated trials of possible networks, so the algorithm must be efficient.

Calculation of TF activities using FTF
The essential equation on which FTF is based was arrived at empirically after extensive numerical experimentation with synthetic data. In this way we actually know the TRN, TF activities, and the nature of noise added to the expression data, and thereby could quantitatively assess the accuracy of FTF predictions. FTF is based on the following ansatz:
T n r  − T n s  = ∑ i = 1  N g e n e     H ( m i r  − m i s  ) b i n   Ψ i n     ,       ( 3 )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaOGaeyOeI0Iaemivaq1aa0baaSqaaiabd6gaUbqaaiabdohaZbaakiabg2da9maaqahabaGaemisaGKaeiikaGIaemyBa02aa0baaSqaaiabdMgaPbqaaiabdkhaYbaakiabgkHiTiabd2gaTnaaDaaaleaacqWGPbqAaeaacqWGZbWCaaGccqGGPaqkcqWGIbGydaWgaaWcbaGaemyAaKMaemOBa4gabeaakiabfI6aznaaBaaaleaacqWGPbqAcqWGUbGBaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaonaaBaaameaacqWGNbWzcqWGLbqzcqWGUbGBcqWGLbqzaeqaaaqdcqGHris5aOGaeiilaWIaaCzcaiaaxMaadaqadaqaaiabiodaZaGaayjkaiaawMcaaaaa@5D38@
where Tnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaaaa@30DC@ = activity of TF n at condition or time r, mir MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaGaemOCaihaaaaa@3104@ = microarray response of gene i at condition r, bin = TRN (bin = +1/-1for gene i up/down regulated by TF n, bin = 0 for no regulation), H(x) = ± 1 for x > or < 0, = 0 for x = 0, and Ψin = 2Li MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaIYaGmdaahaaWcbeqaaiabdYeamnaaBaaameaacqWGPbqAaeqaaaaaaaa@3074@/(Mn(2Li MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaIYaGmdaahaaWcbeqaaiabdYeamnaaBaaameaacqWGPbqAaeqaaaaaaaa@3074@ - 1)) for Li = number of TFs controlling gene i and Mn = number of genes TF n regulates. If there are Nexpression times or conditions, then eq. (1) constitutes Nexpression × (Nexpression -1)/2 equations for the Nexpression activities Tnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaaaa@30DC@ for each of the TFs. Therefore, the problem is overdetermined. In our approach the problem is solved via normal equations, i.e. using a least square approach so that all the expression data is utilized and thereby statistics can help to overcome data uncertainty.
Once TF activities are calculated in this manner, the linear (Pearson) correlation is calculated for all possible gene-TF pairs. This serves as a score used to construct probability distributions for the training set (known gene/TF interactions) and random set (all possible gene/TF pairs). Comparison of these probability distributions gives an idea about the fitness of the preliminary TRN and expression data, and to which degree we can rely on the predictions of FTF. If the preliminary TRN is too small or of poor quality, or if there are too few expression datasets, the training versus random set probability distributions are difficult to distinguish. The scores can also be used to rank genes that are more likely to have expression data which is inconsistent with the preliminary TRN.
To test FTF we generated a TRN that consists of 1000 genes and 100 TFs. The properties of the TRN are shown in Fig. 2. The synthetic expression data was generated by assumed random TF activities. Expression data for gene i was generated using mir=∑n=1NTFQinbinTnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaGaemOCaihaaOGaeyypa0ZaaabCaeaacqWGrbqudaWgaaWcbaGaemyAaKMaemOBa4gabeaakiabdkgaInaaBaaaleaacqWGPbqAcqWGUbGBaeqaaOGaemivaq1aa0baaSqaaiabd6gaUbqaaiabdkhaYbaaaeaacqWGUbGBcqGH9aqpcqaIXaqmaeaacqWGobGtdaWgaaadbaGaemivaqLaemOrayeabeaaa0GaeyyeIuoaaaa@47D2@. Here, mir MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaGaemOCaihaaaaa@3104@ is the expression level of gene i at experiment r, Tnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaaaa@30DC@ is the activity of TF n at experiment r, NTF is the number of TFs, and Qin is a measure of the binding affinity of TF n and gene i.
Figure 2  Properties of TRNs used in the synthetic examples. Networks that consist of 1000 genes and 100 TFs are generated using the probability distribution for the number of genes regulated by a given TF shown in (a). The corresponding probability distribution for the number of regulators per gene is shown in (b). The average number of regulators per gene is 3.62, 5.22, and 7.02 for Networks 1, 2 and 3, respectively. Equal likelihood is chosen for up versus down regulation. To construct a synthetic TRN, for each TF we assigned un = c1 + c2e−c3z MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWydaWgaaWcbaGaeGOmaidabeaakiabdwgaLnaaCaaaleqabaGaeyOeI0Iaem4yam2aaSbaaWqaaiabiodaZaqabaWccqWG6bGEaaaaaa@3588@ where c1, c2, c3 are constants (taken to be 0.02, 0.15, and 5, respectively) and z is a random number (between 0 and 1). Then for each gene/TF pair, we assigned a random number hin (between 0 and 1). For parameter e, which determines how dense the synthetic TRN is, if hinun <e we set bin = -1 (down regulation), if e ≤ hinun < 2e, we set bin = 1 (up regulation), assuming the probability of up and down regulation is the same. The Qin were allowed to change 20 fold and were generated randomly (in the logarithmic scale). TF activities were assumed to be random as well. Our synthetic examples with large TRNs show that, despite the simplicity of the FTF approach, the constructed TF activity profiles are reliable. To test the approach, one can compare the TF activities constructed and those used in the generation of synthetic expression data. For example, for a TRN that has the properties shown in Fig. 2, even when we eliminate 50% of the TRN to create a "preliminary TRN", 90% of the constructed TF activities have a Pearson correlation coefficient of at least 0.70 with the TF activities used to generate the synthetic expression data (when 20 or more microarray experimental conditions were used). Fig. 3 shows the dependence of the results on the number of experiments. This graph shows that, for practical reason, it is not feasible to recover the full network. Fig. 4a shows the effect of network structure on the results. As the network gets denser, the percentage of the network that can be recovered decreases. Fig. 4b illustrates the dependence of the percentage of recovery on the degree of incompleteness in the preliminary TRN. As anticipated, more complete preliminary TRNs allow a higher percentage of the unknown part of the network to be recovered using expression data. These results suggest that in a real world application such as E. coli (for which we have probably less than 40% of the TRN – based on the number of gene/TF interactions known and expected number of TFs), one can not expect to construct the full TRN using expression data alone, regardless of the number of expression datasets available.
Figure 3  Reconstruction of TRNs. We have used the Network 1 of Fig. 2 and generated synthetic expression data. Then, we eliminated 50% of the network (randomly), and used FTF to reconstruct the deleted network. Fig. a) shows the percentage of the deleted network recovered as a function of success rate, a measure of the likelihood that an interaction is correct, as estimated from the training set (known interactions). As the number of microarray experiments increases, a higher percentage of the network can be reconstructed. However, full reconstruction requires too many experiments. Fig. b) shows success rate as a function of the absolute value of the linear correlation between the constructed TF activity profiles and gene expression data.
Figure 4  Effect of TRN properties. We used Networks 1, 2 and 2 of Fig. 2 to generate 100 synthetic expression data sets, and eliminated 50% of the gene/TF interactions in the TRN. Shown is the percentage of the deleted network recovered as a function of success rate. As the number interactions increases, the percentage of the network that can be recovered decreases. b) Same as a) except we used Network 1 and eliminated 25%, 50%, and 75% of the network. As expected, higher percentage of the deleted network is recoverable when a more complete network is known.