2. Materials and Methods 2.1. Data Drosophila melanogaster (fruit fly) data have been employed in our integrative analysis, with several types of data retrieved from publicly available databases. These include time series data from two platforms (retrieved from the Gene Expression Omnibus (GEO) database [31]), a set of knock-out (KO) microarray experiments, position-specific weight matrices (PSWMs; [32]), known cis-regulatory modules and Gene Ontology (GO) annotations. For model validation, a set of previously known transcriptional interactions extracted from the DROID database (Drosophila Interactions Database; [33]) has been used. A subnetwork of 27 genes involved in embryo development, listed in Table 1, has been analysed. These have been chosen by starting from the main genes involved in segmentation [17], followed by the addition of several genes known to interact with this main set (based on the FlyBase interactions browser [34]). microarrays-04-00255-t001_Table 1 Table 1 Set of 27 genes selected for network analysis for the Drosophila melanogaster dataset. Dual-channel (DC) dataset. This time-course dataset analyses gene expression during Fly embryo development, using dual-channel microarrays (GEO Accession GSE14086 [35]). The dataset contains seven time points sampled at 1- and 2-h intervals, up to 10 h after egg laying. Three biological replicates are available, resulting in three time series in total. Single-channel (SC) dataset. The single-channel dataset [36], measured with Affymetrix arrays, contains gene expression measurements for 12 time points during Drosophila melanogaster embryo development. Samples have been taken every hour up to 12 and a half hours after egg laying. Three biological replicates are included. Both the SC and DC datasets were normalised using cross-platform normalisation [37], which was shown previously to be a valid option for time series data integration [29]. Previously known transcriptional interactions (DROID dataset). For validation purposes, a set of known transcriptional interactions has been retrieved from DROID (Drosophila Interactions Database; [33]), Version 2010_10. This consists of 16 pair-wise interactions between transcription factors and their target genes, for the 27-gene network under analysis. This gold standard was used due to the fact that these interactions are confirmed experimentally. Although it is not the complete network for the 27 genes and it does not include PPIs, it does help to indicate the quality of our models in terms of underlining transcriptional regulation, which is the interest of this study. The exact interactions are included as Supplementary Material. KO dataset. Five KO microarray datasets have been retrieved form the GEO database. These contain knock-out experiments for 8 genes and the corresponding wild-type measurements. The accession numbers for the datasets are GSE23346 ([38], Affymetrix Drosophila Genome 2.0 Array, 6 samples), GSE9889 ([39], Affymetrix Drosophila Genome Array, 20 samples), GSE7772 ([40], Affymetrix Drosophila Genome Array, 4 samples), GSE3854 ([41], Affymetrix Drosophila Genome Array, 54 samples) and GSE14086 ([35], dual-channel array, 63 samples). For these, the log-ratios between knock-out and wild-type expression values have been used within our framework. Binding site affinities (BSAs). A set of PSWMs for 11 transcription factors have been retrieved from [42]. These matrices have been computed using DNA foot-printing data from [43]. In order to compute BSAs using PSWMs, the promoter sequence for each gene is required. For the Drosophila genome, the RedFly database [44] provides a set of known cis-regulatory modules, which have been used here for this purpose. Cis-regulatory modules for 16 genes have been retrieved, while for the other genes, the upstream 2-Kbp sequence has been used to assess BSA. Using both information types, BSAs were computed for use in our algorithm. GO annotations. GO [45] is a database of genes, which have been annotated to have a specific function or to be involved in specific processes. These annotations come from various sources and have been determined using technologies ranging from those in wet-lab experiments to computational methods. The database is a valuable source of meta-information that can be used in different ways. Here, we have used the GO platform to identify which of the gene products involved in the network analysed have been previously shown to display transcription factor activity. Correlations (CORR). All gene expression data related to the genes of interest were combined, and the Pearson product moment correlation coefficient was computed between gene pairs. These data were fed into the EGIA framework. 2.2. GRN Inference Reverse engineering is performed using the platform EGIA (evolutionary computation for GRNs, an integrative algorithm) [30]. This uses evolutionary optimisation to infer artificial neural network (ANN) models of regulation. The model describes the gene expression level of each gene i at a certain time t as the output of a sigmoid unit, with input given by the expression values of the gene’s regulators at time t−1: (1) gi(t)=S(∑j∈Riwijgj(t−1)) where S is the logistic function and Ri is the set of regulators of gene i, while wij are the strengths of the effect of gene j on gene i. The inference problem is divided into two parts: finding the set of regulators Ri for each gene and finding the strength of the regulation (parameters wij). The first is solved using a genetic algorithm, which evolves the topology of the GRN. Each topology is evaluated by training the corresponding ANN using time series gene expression data. This training procedure also solves the second problem of finding interaction weights. The training error (assessed by root mean squared error between the real and simulated expression levels) reflects the fitness of the original topology. The only mandatory type of data for the EGIA platform is time series gene expression data, which give the base fitness of the models. Several measuring platforms and time series can be used, provided these are properly scaled. However, data other than gene expression levels can also be integrated into the optimisation process. The algorithm employs two mechanisms to integrate other data. These accept all types described above or any subset (except for DROID interactions, which are used for model evaluation only). Network structure exploration (NSEx). The EGIA algorithm starts with a set of topologies and permits these to evolve by changing links between genes until better solutions are found. The topologies are obtained by selecting, for each gene, a set of regulators. In a basic genetic algorithm, the selection of the regulators for each gene is random, both when initialising the topologies and when evolving them (initialisation and mutation). EGIA, however, uses non-uniform probability distributions to select regulators, which are based on the additional data. For instance, if a KO experiment shows large log-ratios for a specific gene, this indicates a higher probability link to the silenced gene and increases the probability that this gene is selected as a regulator. Similar mechanisms apply for all data types included, with further details given in [30]. Network structure evaluation (NSEv). It is important to include additional data in the exploration of the space of possible structures, as it speeds up the search for models with more realistic interactions. However, it is the final evaluation of the topologies, during the evolutionary process, that decides which solutions are taken to the next generation. A basic algorithm for this would use only the ability of the model to reproduce time series data. However, it may be the case that models with well-established interactions are not complete enough to simulate the data well, so these would be discarded. In order to reduce this effect, when determining the fitness of the topologies, EGIA uses an additional term, which measures how likely a given topology is, based on the NSEx process probabilities, described previously. In this way, all data types available can be integrated. More detail on the implementation can be found in [30]. Both mechanisms above attempt to reduce the under-determination problem for large GRNs. NSEx drives the algorithm more quickly towards useful areas of the search space, while NSEv ensures the longevity of `partially good’ topologies. Hence, the first mechanism can be seen as a guideline only (a weaker integration criterion), while the second is a stronger integration criterion, since it determines the best model. This means that, in order to augment performance, the first mechanism accepts a wider range of data compared to the second. 2.3. Analysis Given the different integration mechanisms available in EGIA, we performed an analysis of possibilities for data integration. The aim here was to evaluate at which stage each dataset is most useful, in order to provide an optimised strategy for integration. Models obtained were evaluated both qualitatively (the topological structure of the GRN, i.e., pairwise transcriptional interactions) and quantitatively (the ability to reproduce the evolution of gene expression levels over time). Qualitatively, the AUROC (area under the ROC curve) and AUPR (area under the precision-recall curve) [46] are computed, using the set of known transcriptional interactions from the DROID data as the gold standard. Given that our algorithm is stochastic in nature and the model quantitative, predictions of interactions have been performed by using multiple models (obtained in different runs) and employing a `voting procedure’ for possible interactions. In this way, an interaction that appears in more models is considered to be more plausible (this method of voting has been previously used to extract qualitative information in similar problems [47,48]). The set of possible interactions is ranked from the highest to lowest number of votes and used for AUROC/AUPR computation. To evaluate the variability of results for the models, we also computed AUROC/AUPR values for subsets of 9 models at a time and reported the standard deviation over all values. This was to enable significant comparison among data integration strategies. Quantitative evaluation of our results refers to the ability of the models to reproduce continuous levels of gene expression over time, through time series simulation. The model starts from the values of gene expression for the first time point in a time series, then evolves independently to generate a simulated series. This is compared with the original data, by computing the RMSE, which gives a quantitative evaluation of the model. In this paper, we employed a cross-validation approach, where the SC time series dataset was used for training and the DC time series for testing. Thus, we started by extracting models from the SC time series dataset, evaluating their ability to simulate quantitative gene expression levels from the DC dataset. The analysis presented here follows three stages. First, NSEx is employed alone using the available datasets to evaluate whether these are useful for this weaker integration mechanism (Section 3.1). Secondly, NSEv is added to analyse all datasets and to identify the integration scheme with the best results (Section 3.2). Using this scheme, the DC dataset, previously used for quantitative evaluation only, was included in the model extraction phase. The final model thus obtained was qualitatively evaluated only (Section 3.3), since all time series data were used as training data. Table 2 summarises, for the three different analysis stages, how each dataset was used. microarrays-04-00255-t002_Table 2 Table 2 Usage of each dataset at the different analysis stages. 3.