2.2. GRN Inference
Reverse engineering is performed using the platform EGIA (evolutionary computation for GRNs, an integrative algorithm) [30]. This uses evolutionary optimisation to infer artificial neural network (ANN) models of regulation. The model describes the gene expression level of each gene i at a certain time t as the output of a sigmoid unit, with input given by the expression values of the gene’s regulators at time t−1: (1) gi(t)=S(∑j∈Riwijgj(t−1)) where S is the logistic function and Ri is the set of regulators of gene i, while wij are the strengths of the effect of gene j on gene i. The inference problem is divided into two parts: finding the set of regulators Ri for each gene and finding the strength of the regulation (parameters wij). The first is solved using a genetic algorithm, which evolves the topology of the GRN. Each topology is evaluated by training the corresponding ANN using time series gene expression data. This training procedure also solves the second problem of finding interaction weights. The training error (assessed by root mean squared error between the real and simulated expression levels) reflects the fitness of the original topology.
The only mandatory type of data for the EGIA platform is time series gene expression data, which give the base fitness of the models. Several measuring platforms and time series can be used, provided these are properly scaled. However, data other than gene expression levels can also be integrated into the optimisation process. The algorithm employs two mechanisms to integrate other data. These accept all types described above or any subset (except for DROID interactions, which are used for model evaluation only).
Network structure exploration (NSEx). The EGIA algorithm starts with a set of topologies and permits these to evolve by changing links between genes until better solutions are found. The topologies are obtained by selecting, for each gene, a set of regulators. In a basic genetic algorithm, the selection of the regulators for each gene is random, both when initialising the topologies and when evolving them (initialisation and mutation). EGIA, however, uses non-uniform probability distributions to select regulators, which are based on the additional data. For instance, if a KO experiment shows large log-ratios for a specific gene, this indicates a higher probability link to the silenced gene and increases the probability that this gene is selected as a regulator. Similar mechanisms apply for all data types included, with further details given in [30].
Network structure evaluation (NSEv). It is important to include additional data in the exploration of the space of possible structures, as it speeds up the search for models with more realistic interactions. However, it is the final evaluation of the topologies, during the evolutionary process, that decides which solutions are taken to the next generation. A basic algorithm for this would use only the ability of the model to reproduce time series data. However, it may be the case that models with well-established interactions are not complete enough to simulate the data well, so these would be discarded. In order to reduce this effect, when determining the fitness of the topologies, EGIA uses an additional term, which measures how likely a given topology is, based on the NSEx process probabilities, described previously. In this way, all data types available can be integrated. More detail on the implementation can be found in [30].
Both mechanisms above attempt to reduce the under-determination problem for large GRNs. NSEx drives the algorithm more quickly towards useful areas of the search space, while NSEv ensures the longevity of `partially good’ topologies. Hence, the first mechanism can be seen as a guideline only (a weaker integration criterion), while the second is a stronger integration criterion, since it determines the best model. This means that, in order to augment performance, the first mechanism accepts a wider range of data compared to the second.