Methods Network definition The TRN we seek to discover is a list of genes for each of which a set of TFs with up/down regulation is provided (bin = +1/-1 for gene i up/down regulated by TF n). The gene-gene regulation network often considered is implied as the components of each TF and the genes that encode them are also included in our TRNs. This TRN definition provides a unifying framework for all the individual TRN discovery methods we developed, as well as a methodology for the integration of multiple methods. We use multiple methodologies to suggest enhanced TRNs based on three hypotheses and a training set TRN to test them. The result of each methodology is weighed proportional to its success rate using the training set. This approach goes beyond studies that focus on gene-gene networks as it provides more detailed information (such as gene A is up regulated by TF B) that can be tested experimentally and used in medical and biotechnical applications. We demonstrate that methodologies such as gene ontology and phylogenic similarity provide better results when a preliminary set of gene/TF interactions is used instead of a training set of gene-gene data. A simple algorithm, described below, is used to calculate gene-TF scores from gene-gene similarity scores and a preliminary TRN. In addition, we use a novel approach to first approximate TF activity profiles using the preliminary TRN and gene expression data, and then using these TF activities to suggest additional gene/TF interactions via a gene-TF correlation scheme. From gene-gene scores to gene-TF scores Two of the methodologies (GO and phylogeny) used in this study generate gene-gene similarity scores. As our interest is the discovery of TRNs as defined above, the question is how one can use the gene-gene similarity scores and the preliminary TRN to score gene/TF interactions. For a system of Ngene genes, there are Ngene × (Ngene - 1)/2 gene-gene pairs. In order to find the score for gene A and TF B, we first seek all genes regulated by TF B in the preliminary TRN. Then we calculate the gene-gene similarity score for the gene of interest with each gene regulated by TF B. We assign the maximum of these scores to the gene A/TF B interaction. Although this appears to be a rough estimation of the gene-TF score, our computational experiments with gene-gene similarity based on gene ontology and phylogeny have shown that this score clearly distinguishes the probability distributions of the training and random sets of gene/TF interactions. Gene ontology analysis In this analysis we use the biological process ontology developed by the Gene Ontology (GO) consortium [21,22], the GO annotations from EMBL-EBI [23] and hypothesize that the likelihood for a gene pair to be regulated in the same manner increases with the similarity of their GO description. GO analysis was proposed by [20] who applied it to find functional modules in E. coli. However, here a training set of gene/TF interactions is used instead of a gene-gene pair-based one. In particular, we use a preliminary E. coli TRN and transform the gene-gene scores to gene-TF scores. Each GO is structured as a directed acyclic graph. The GO similarity score between two gene products is based on the number of shared ancestors. As a gene product might be assigned with multiple GO terms, we seek the maximum similarity score between all possible combinations. Let gene i and gene j be assigned hi and hj GO terms, respectively. Then the GO similarity for the gene (i, j) pair is taken to be the maximum number of shared ancestors for all combinations of the hi and hj. Phylogenic similarity analysis Phylogenic similarity analysis, also proposed by [20], is based on the hypothesis that a pair of genes with large phylogenic similarity score is likely in the same functional operon, regulon or pathway. Our implementation differs in that we suggest that if two genes have high phylogenic similarity score, then they would be regulated in the same manner by the same set of TFs. Based on this hypothesis we extend the preliminary TRN. Our approach is to calculate phylogenic similarity for gene-gene pairs follows the methodology proposed by [20] (referred to as 'likelihood of neighboring profiles' in their work). In this analysis all bacteria sequence information is downloaded from [24] and all preliminary gene/TF interactions are from [14]. Once we have phylogenic similarity scores for all gene pairs, we calculate the gene/TF scores based on the methodology described in the From Gene-Gene Scores to Gene/TF Scores Section. Calculation of the phylogenic similarity We first construct a vector for each gene in E. coli, the dimension of the vector being the number of genomes used in the analysis (in this study 229). We applied BLASTP to identify probable orthologous genes of a target genome in 229 reference genomes. The most significant BLASTP hit from each reference species was considered the true ortholog of the target species if the expectation value was less than 1.0e-10 [25]. If there is an orthologous gene in the ith genome, then the ith entry in this vector is assigned the order of the orthologous gene in the ith genome. If an orthologous gene does not exist in the ith genome, then this entry is taken to be 0. Once such a vector for each E. coli gene is constructed, we compute a phylogenic similarity measure for each gene pair. Given two vectors Xi = [xi1, xi2,...,xi229] for gene i and similarly Xj for gene j, we use the following phylogenic similarity measure for a gene pair: S i j P H Y = − ∑ k = 1 229 log ⁡ [ P ( x i k , x j k ) ] .       ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaemyAaKMaemOAaOgabaGaemiuaaLaemisaGKaemywaKfaaOGaeyypa0JaeyOeI0YaaabCaeaacyGGSbaBcqGGVbWBcqGGNbWzcqGGBbWwcqWGqbaucqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabcYcaSiabdIha4naaBaaaleaacqWGQbGAcqWGRbWAaeqaaOGaeiykaKIaeiyxa0faleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqaIYaGmcqaIYaGmcqaI5aqoa0GaeyyeIuoakiabc6caUiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@568D@ Here P(xik, xjk), the likelihood of genes i and j, is calculated from = ( 1 − p i k ) ( 1 − p j k ) i f x i k = 0   a n d x j k = 0 P ( x i k , x j k ) = p i k ( 1 − p j k ) i f x i k ≠ 0   a n d x j k = 0 = ( 1 − p i k ) p j k i f x i k = 0   a n d x j k ≠ 0 = p i k p j k d ( x i k , x j k ) ( 2 N k − d ( x i k , x j k ) − 1 ) N k ( N k − 1 ) i f x i k ≠ 0   a n d x j k ≠ 0       ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaaabdaaaaeaaaeaacqGH9aqpcqGGOaakcqaIXaqmcqGHsislcqWGWbaCdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabcMcaPiabcIcaOiabigdaXiabgkHiTiabdchaWnaaBaaaleaacqWGQbGAcqWGRbWAaeqaaOGaeiykaKcabaacbaGae8xAaKMae8NzayMae8hiaaIaemiEaG3aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGH9aqpcqaIWaamcqqGGaaicqWFHbqycqWFUbGBcqWFKbazcqWFGaaicqWG4baEdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabg2da9iabicdaWaqaaiabdcfaqjabcIcaOiabdIha4naaBaaaleaacqWGPbqAcqWGRbWAaeqaaOGaeiilaWIaemiEaG3aaSbaaSqaaiabdQgaQjabdUgaRbqabaGccqGGPaqkaeaacqGH9aqpcqWGWbaCdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabcIcaOiabigdaXiabgkHiTiabdchaWnaaBaaaleaacqWGQbGAcqWGRbWAaeqaaOGaeiykaKcabaGae8xAaKMae8NzayMae8hiaaIaemiEaG3aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGHGjsUcqaIWaamcqqGGaaicqWFHbqycqWFUbGBcqWFKbazcqWFGaaicqWG4baEdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabg2da9iabicdaWaqaaaqaaiabg2da9iabcIcaOiabigdaXiabgkHiTiabdchaWnaaBaaaleaacqWGPbqAcqWGRbWAaeqaaOGaeiykaKIaemiCaa3aaSbaaSqaaiabdQgaQjabdUgaRbqabaaakeaacqWFPbqAcqWFMbGzcqWFGaaicqWG4baEdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabg2da9iabicdaWiabbccaGiab=fgaHjab=5gaUjab=rgaKjab=bcaGiabdIha4naaBaaaleaacqWGQbGAcqWGRbWAaeqaaOGaeyiyIKRaeGimaadabaaabaGaeyypa0JaemiCaa3aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqWGWbaCdaWgaaWcbaGaemOAaOMaem4AaSgabeaakmaalaaabaGaemizaqMaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGGSaalcqWG4baEdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabcMcaPiabcIcaOiabikdaYiabd6eaonaaBaaaleaacqWGRbWAaeqaaOGaeyOeI0IaemizaqMaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGGSaalcqWG4baEdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabcMcaPiabgkHiTiabigdaXiabcMcaPaqaaiabd6eaonaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaemOta40aaSbaaSqaaiabdUgaRbqabaGccqGHsislcqaIXaqmcqGGPaqkaaaabaGae8xAaKMae8NzayMae8hiaaIaemiEaG3aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGHGjsUcqaIWaamcqqGGaaicqWFHbqycqWFUbGBcqWFKbazcqWFGaaicqWG4baEdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabgcMi5kabicdaWaaacaWLjaGaaCzcamaabmaabaGaeGOmaidacaGLOaGaayzkaaaaaa@F695@ where pik is the probability that gene i is present in genome k. Nk is the total number of genes in reference genome k d(xik, xjk) = abs(xik - xjk). To calculate pik, we grouped 229 reference genomes into subgroups based on information gathered from [26,27] (see Table 1). It is assumed that pik is identical within each subgroup for each gene. Then pik is taken to be the ratio of number of genomes that has an orthologous gene to the total number of genomes in the subgroup. Table 1 The list of bacteria used in the phylogenic similarity analysis. Subgroup Bacteria Actinobacteria Bifidobacterium longum NCC2705, Corynebacterium diphtheriae NCTC 13129, Corynebacterium efficiens YS-314, Corynebacterium glutamicum ATCC13032, Corynebacterium glutamicum ATCC 13032, Leifsonia xyli subsp. xyli str. CTCB07, Mycobacterium avium subsp. paratuberculosis str. k10, Mycobacterium bovis AF2122/97, Mycobacterium leprae TN, Mycobacterium tuberculosis H37Rv, Mycobacterium tuberculosis CDC1551, Nocardia farcinica IFM 10152, Propionibacterium acnes KPA171202, Streptomyces avermitilis MA-4680, Streptomyces coelicolor A3(2), Symbiobacterium thermophilum IAM 14863, Tropheryma whipplei TW08/27, Tropheryma whipplei str. Twist Aquificae Aquifex aeolicus VF5 Bacteroidetes Bacteroides fragilis YCH46, Bacteroides fragilis NCTC 9343, Bacteroides thetaiotaomicron VPI-5482, Porphyromonas gingivalis W83 Cyanobacteria Prochlorococcus marinus subsp. marinus str. CCMP1375, Prochlorococcus marinus str. MIT 9313 Chlamydiae Chlamydophila abortus S26/3, Chlamydia muridarum Nigg, Chlamydia trachomatis D/UW-3/CX, Chlamydophila caviae GPIC, Chlamydophila pneumoniae AR39, Chlamydophila pneumoniae CWL029, Chlamydophila pneumoniae J138, Chlamydophila pneumoniae TW-183, Parachlamydia sp. UWE25 Chlorobi Chlorobium tepidum TLS Chloroflexi Dehalococcoides ethenogenes 195 Crenarchaeota Aeropyrum pernix K1, Pyrobaculum aerophilum str. IM2, Sulfolobus solfataricus P2, Sulfolobus tokodaii str. 7 Cyanobacteria Gloeobacter violaceus PCC 7421, Nostoc sp. PCC 7120, Prochlorococcus marinus subsp. pastoris str. CCMP1986, Synechococcus elongatus PCC 6301, Synechococcus sp. WH 8102, Synechocystis sp. PCC 6803, Thermosynechococcus elongatus BP-1 Deinococcus-Thermus Deinococcus radiodurans R1, Thermus thermophilus HB27, Thermus thermophilus HB8 Euryarchaeota Archaeoglobus fulgidus DSM 4304, Haloarcula marismortui ATCC 43049, Halobacterium sp. NRC-1, Methanothermobacter thermautotrophicus str.Delta H, Methanocaldococcus jannaschii DSM 2661, Methanococcus maripaludis S2, Methanopyrus kandleri AV19, Methanosarcina acetivorans C2A, Methanosarcina mazei Go1, Picrophilus torridus DSM 9790, Pyrococcus abyssi GE5, Pyrococcus furiosus DSM 3638, Pyrococcus horikoshii OT3, Thermococcus kodakaraensis KOD1, Thermoplasma acidophilum DSM 1728, Thermoplasma volcanium GSS1 Firmicutes Bacillus anthracis str. Ames, Bacillus anthracis str. 'Ames Ancestor', Bacillus anthracis str. Sterne, Bacillus cereus ATCC 14579, Bacillus cereus ATCC 10987, Bacillus cereus ZK, Bacillus clausii KSM-K16, Bacillus halodurans C-125, Bacillus licheniformis ATCC 14580, Bacillus subtilis subsp. subtilis str. 168, Bacillus thuringiensis serovar konkukian str. 97-27, Clostridium acetobutylicum ATCC 824, Clostridium perfringens str. 13, Clostridium tetani E88, Enterococcus faecalis V583, Geobacillus kaustophilus HTA426, Lactobacillus acidophilus NCFM, Lactobacillus johnsonii NCC 533, Lactobacillus plantarum WCFS1, Lactococcus lactis subsp. lactis Il1403, Listeria innocua Clip11262, Listeria monocytogenes EGD-e, Listeria monocytogenes str. 4b F2365, Mesoplasma florum L1, Mycoplasma gallisepticum R, Mycoplasma genitalium G-37, Mycoplasma hyopneumoniae 232, Mycoplasmamobile 163K, Mycoplasma mycoides subsp. mycoides SC str. PG1, Mycoplasma penetrans HF-2, Mycoplasma pneumoniae M129, Mycoplasma pulmonis UAB CTIP, Oceanobacillus iheyensis HTE831, Onion yellows phytoplasma OY-M, Staphylococcus aureus subsp. aureus COL, Staphylococcus aureus subsp. aureus MW2, Staphylococcus aureus subsp. aureus Mu50, Staphylococcus aureus subsp. aureus N315, Staphylococcus aureus subsp. aureus MRSA252, Staphylococcus aureus subsp. aureus MSSA476, Staphylococcus epidermidis ATCC 12228, Staphylococcus epidermidis RP62A, Streptococcus agalactiae 2603V/R, Streptococcus agalactiae NEM316, Streptococcus mutans UA159, Streptococcus pneumoniae R6, Streptococcus pneumoniaeTIGR4, Streptococcus pyogenes M1 GAS, Streptococcus pyogenes MGAS10394, Streptococcus pyogenes MGAS315, Streptococcus pyogenes MGAS8232, Streptococcus pyogenes SSI-1, Streptococcus thermophilus CNRZ1066, Streptococcus thermophilus LMG 18311, Thermoanaerobacter tengcongensis MB4, Ureaplasma parvum serovar 3 str. ATCC 700970 Fusobacteria Fusobacterium nucleatum subsp. nucleatum ATCC 25586 Nanoarchaeota Nanoarchaeum equitans Kin4-M Planctomycetes Rhodopirellula baltica SH 1 Proteobacteria Acinetobacter sp. ADP1, Agrobacterium tumefaciens str. C58, Agrobacterium tumefaciens str. C58, Anaplasma marginale str. St. Maries, Azoarcus sp. EbN1, Bartonella henselae str. Houston-1, Bartonella quintana str. Toulouse, Bdellovibrio bacteriovorus HD100, Candidatus Blochmannia floridanus, Bordetella bronchiseptica RB50, Bordetella parapertussis 12822, Bordetella pertussis Tohama I, Bradyrhizobium japonicum USDA 110, Brucella abortus biovar 1 str. 9–941, Brucella melitensis 16M, Brucella suis 1330, Buchnera aphidicola str. Bp (Baizongia pistaciae), Buchnera aphidicola str. Sg (Schizaphis graminum), Buchnera aphidicola str. APS (Acyrthosiphon pisum), Burkholderia mallei ATCC 23344, Burkholderia pseudomallei K96243, Campylobacter jejuni subsp. jejuni NCTC 11168, Campylobacter jejuni RM1221, Caulobacter crescentus CB15, Chromobacterium violaceum ATCC 12472, Coxiella burnetii RSA 493, Desulfotalea psychrophila LSv54, Desulfovibrio vulgaris subsp. vulgaris str. Hildenborough, Ehrlichia ruminantium str. Gardel, Ehrlichia ruminantium str. Welgevonden, Ehrlichia ruminantium str. Welgevonden, Erwinia carotovora subsp. atroseptica SCRI1043, Escherichia coli CFT073, Escherichia coli K12, Escherichia coli O157:H7 EDL933, Escherichia coli O157:H7, Francisella tularensis subsp. tularensis Schu 4, Gluconobacter oxydans 621H, Geobacter sulfurreducens PCA, Haemophilus ducreyi 35000HP, Haemophilus influenzae Rd KW20, Helicobacter hepaticus ATCC 51449, Helicobacter pylori 26695, Helicobacter pylori J99, Idiomarina loihiensis L2TR, Legionella pneumophila str. Lens, Legionella pneumophila str. Paris, Legionella pneumophila subsp. pneumophila str. Philadelphia 1, Mannheimia succiniciproducens MBEL55E, Mesorhizobium loti MAFF303099, Methylococcus capsulatus str. Bath, Neisseria gonorrhoeae FA 1090, Neisseria meningitidis MC58, Neisseria meningitidis Z2491, Nitrosomonas europaea ATCC 19718, Pasteurella multocida subsp.multocida str. Pm70, Photobacterium profundum SS9, Photorhabdus luminescens subsp. laumondii TTO1, Pseudomonas aeruginosa PAO1, Pseudomonas putida KT2440, Pseudomonas syringae pv. syringae B728a, Pseudomonas syringae pv. tomato str. DC3000, Ralstonia solanacearum GMI1000, Rhodopseudomonas palustris CGA009, Rickettsia conorii str. Malish 7, Rickettsia prowazekii str. Madrid E, Rickettsia typhi str. Wilmington, Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67, Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150, Salmonella enterica subsp. enterica serovar Typhi str. CT18, Salmonella enterica subsp. enterica serovar Typhi Ty2, Salmonella typhimurium LT2, Shewanella oneidensis MR-1, Shigella flexneri 2a str. 301, Silicibacter pomeroyi DSS-3, Sinorhizobium meliloti 1021, Shigella flexneri 2a str. 2457T, Vibrio cholerae O1 biovar eltor str. N16961, Vibrio fischeri ES114, Vibrio parahaemolyticus RIMD 2210633, Vibriovulnificus CMCP6, Vibrio vulnificus YJ016, Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis, Wolbachia endosymbiont strain TRS of Brugia malayi, Wolbachia endosymbiont of Drosophila melanogaster, Wolinella succinogenes DSM 1740, Xanthomonas campestris pv. campestris str. ATCC 33913, Xylella fastidiosa 9a5c, Xanthomonas axonopodis pv. citri str. 306, Xanthomonas campestris pv. campestris str. 8004, Xanthomonas oryzae pv. oryzae KACC10331, Xylella fastidiosa Temecula1, Yersinia pestis biovar Medievalis str. 91001, Yersinia pestis CO92, Yersinia pestis KIM, Yersinia pseudotuberculosis IP 32953, Zymomonas mobilis subsp. mobilis ZM4 Spirochaetes Borrelia burgdorferi B31, Borrelia garinii PBi chromosome linear, Leptospira interrogans serovar Copenhageni str. Fiocruz L1-130, Leptospira interrogans serovar Lai str. 56601, Treponema denticola ATCC 35405, Treponema pallidum subsp. pallidum str. Nichols Thermotogae Thermotoga maritima MSB8 Microarray analysis Kinetic cell models hold great promise for predicting cell behavior [28-32]. Unfortunately there is a lack of information about many of the rate and equilibrium constants for the reaction and transport processes involved [33,34]. Simultaneously calibrating all the reaction/transport rate parameters and discovering the gene/TF interaction network structure from available data does not appear to be feasible. Therefore, instead of using a kinetic approach as a basis of TRN construction, we have developed FTF (Fast Transcription Factor analyzer) for network construction via (1) TF activity estimation, (2) statistical arguments, and (3) a preliminary TRN. Once a reliable TRN is obtained using FTF, it can then be used to calibrate the rate and equilibrium constants that appear in transcription/translation kinetic models. An example of such an approach is available at [35]. FTF was designed based on the following notions: • a method based on TFs has the advantage that microarray noise, and errors in preliminary TRN, can be overcome by statistics – i.e. the regulation of many genes by a given TF; • due to data uncertainty, there is not usually enough information content in many single-gene responses to unambiguously determine the effect of all TFs regulating it; and • TRN discovery requires many automated trials of possible networks, so the algorithm must be efficient. Calculation of TF activities using FTF The essential equation on which FTF is based was arrived at empirically after extensive numerical experimentation with synthetic data. In this way we actually know the TRN, TF activities, and the nature of noise added to the expression data, and thereby could quantitatively assess the accuracy of FTF predictions. FTF is based on the following ansatz: T n r − T n s = ∑ i = 1 N g e n e H ( m i r − m i s ) b i n Ψ i n ,       ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaOGaeyOeI0Iaemivaq1aa0baaSqaaiabd6gaUbqaaiabdohaZbaakiabg2da9maaqahabaGaemisaGKaeiikaGIaemyBa02aa0baaSqaaiabdMgaPbqaaiabdkhaYbaakiabgkHiTiabd2gaTnaaDaaaleaacqWGPbqAaeaacqWGZbWCaaGccqGGPaqkcqWGIbGydaWgaaWcbaGaemyAaKMaemOBa4gabeaakiabfI6aznaaBaaaleaacqWGPbqAcqWGUbGBaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaonaaBaaameaacqWGNbWzcqWGLbqzcqWGUbGBcqWGLbqzaeqaaaqdcqGHris5aOGaeiilaWIaaCzcaiaaxMaadaqadaqaaiabiodaZaGaayjkaiaawMcaaaaa@5D38@ where Tnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaaaa@30DC@ = activity of TF n at condition or time r, mir MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaGaemOCaihaaaaa@3104@ = microarray response of gene i at condition r, bin = TRN (bin = +1/-1for gene i up/down regulated by TF n, bin = 0 for no regulation), H(x) = ± 1 for x > or < 0, = 0 for x = 0, and Ψin = 2Li MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaIYaGmdaahaaWcbeqaaiabdYeamnaaBaaameaacqWGPbqAaeqaaaaaaaa@3074@/(Mn(2Li MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaIYaGmdaahaaWcbeqaaiabdYeamnaaBaaameaacqWGPbqAaeqaaaaaaaa@3074@ - 1)) for Li = number of TFs controlling gene i and Mn = number of genes TF n regulates. If there are Nexpression times or conditions, then eq. (1) constitutes Nexpression × (Nexpression -1)/2 equations for the Nexpression activities Tnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaaaa@30DC@ for each of the TFs. Therefore, the problem is overdetermined. In our approach the problem is solved via normal equations, i.e. using a least square approach so that all the expression data is utilized and thereby statistics can help to overcome data uncertainty. Once TF activities are calculated in this manner, the linear (Pearson) correlation is calculated for all possible gene-TF pairs. This serves as a score used to construct probability distributions for the training set (known gene/TF interactions) and random set (all possible gene/TF pairs). Comparison of these probability distributions gives an idea about the fitness of the preliminary TRN and expression data, and to which degree we can rely on the predictions of FTF. If the preliminary TRN is too small or of poor quality, or if there are too few expression datasets, the training versus random set probability distributions are difficult to distinguish. The scores can also be used to rank genes that are more likely to have expression data which is inconsistent with the preliminary TRN. To test FTF we generated a TRN that consists of 1000 genes and 100 TFs. The properties of the TRN are shown in Fig. 2. The synthetic expression data was generated by assumed random TF activities. Expression data for gene i was generated using mir=∑n=1NTFQinbinTnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaGaemOCaihaaOGaeyypa0ZaaabCaeaacqWGrbqudaWgaaWcbaGaemyAaKMaemOBa4gabeaakiabdkgaInaaBaaaleaacqWGPbqAcqWGUbGBaeqaaOGaemivaq1aa0baaSqaaiabd6gaUbqaaiabdkhaYbaaaeaacqWGUbGBcqGH9aqpcqaIXaqmaeaacqWGobGtdaWgaaadbaGaemivaqLaemOrayeabeaaa0GaeyyeIuoaaaa@47D2@. Here, mir MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaGaemOCaihaaaaa@3104@ is the expression level of gene i at experiment r, Tnr MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemOBa4gabaGaemOCaihaaaaa@30DC@ is the activity of TF n at experiment r, NTF is the number of TFs, and Qin is a measure of the binding affinity of TF n and gene i. Figure 2 Properties of TRNs used in the synthetic examples. Networks that consist of 1000 genes and 100 TFs are generated using the probability distribution for the number of genes regulated by a given TF shown in (a). The corresponding probability distribution for the number of regulators per gene is shown in (b). The average number of regulators per gene is 3.62, 5.22, and 7.02 for Networks 1, 2 and 3, respectively. Equal likelihood is chosen for up versus down regulation. To construct a synthetic TRN, for each TF we assigned un = c1 + c2e−c3z MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWydaWgaaWcbaGaeGOmaidabeaakiabdwgaLnaaCaaaleqabaGaeyOeI0Iaem4yam2aaSbaaWqaaiabiodaZaqabaWccqWG6bGEaaaaaa@3588@ where c1, c2, c3 are constants (taken to be 0.02, 0.15, and 5, respectively) and z is a random number (between 0 and 1). Then for each gene/TF pair, we assigned a random number hin (between 0 and 1). For parameter e, which determines how dense the synthetic TRN is, if hinun > 1, an interaction with a score R for a given method is highly likely to be correct. These Bayesian ratios are computed for each method and gene/TF interaction. The sum of the log10 of these ratios is taken to be the multi-method confidence measure Kin: K i n = ∑ k = 1 N m e t h w k log ⁡ 10 ( f t r k ( R i n k ) f r a n d k ( R i n k ) )       ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaWgaaWcbaGaemyAaKMaemOBa4gabeaakiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaiabdUgaRbqabaGccyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGymaeJaeGimaadabeaakmaabmaabaWaaSaaaeaacqWGMbGzdaqhaaWcbaGaemiDaqNaemOCaihabaGaem4AaSgaaOGaeiikaGIaemOuai1aa0baaSqaaiabdMgaPjabd6gaUbqaaiabdUgaRbaakiabcMcaPaqaaiabdAgaMnaaDaaaleaacqWGYbGCcqWGHbqycqWGUbGBcqWGKbazaeaacqWGRbWAaaGccqGGOaakcqWGsbGudaqhaaWcbaGaemyAaKMaemOBa4gabaGaem4AaSgaaOGaeiykaKcaaaGaayjkaiaawMcaaaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemOta40aaSbaaWqaaiabd2gaTjabdwgaLjabdsha0jabdIgaObqabaaaniabggHiLdGccaWLjaGaaCzcamaabmaabaGaeGinaqdacaGLOaGaayzkaaaaaa@6960@ where wk is a weighting factor, Nmeth is the number of TRN construction methodologies, Rink MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaemyAaKMaemOBa4gabaGaem4AaSgaaaaa@3225@ is the score for TF n and gene i using methodology k, ftrk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaqhaaWcbaGaemiDaqNaemOCaihabaGaem4AaSgaaaaa@326B@ and frandk MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaqhaaWcbaGaemOCaiNaemyyaeMaemOBa4MaemizaqgabaGaem4AaSgaaaaa@34FB@ are the probability distributions for the training set and random set, respectively. If a methodology fails to have a prediction for a gene-TF pair, it is excluded in the above calculation. The weighting factors are taken to be unity in this study.