3. Results and Discussion

3.1. Integration for NSEx
The first data integration analysed the effects of the NSEx mechanism, which integrates the additional datasets for exploring the structure of the interaction network. In order to identify which dataset is more useful (of those described; Section 2.1), different variants of NSEx were employed to assess the contribution of each data type separately, followed by the integration of all types. Hence, five different variants of NSEx were derived and compared to the algorithms using the SC time series data only: SC+NSEx.KO (using knock-out experiments for NSEx), SC+NSEx.GO (using GO annotations for NSEx), SC+NSEx.BSA (using binding site affinities for NSEx), SC+NSEx.CORR (using correlation among genes for NSEx) and SC+NSEx.ALL (using all data for NSEx).
Table 3 displays AUROC and AUPR values for the five NSEx variants compared to the SC algorithm (using time-series data only). Standard deviations for all values are also included, computed from nine out of ten runs at a time (bootstrapping) and showing very low variability of results. In terms of individual datasets, the set of predicted transcriptional interactions improves when including BSA and KO data, while GO and CORR data seem to have no effect or impact negatively on the interaction quality. However, the combined effect of all datasets does appear to achieve significant improvement in network topology, indicating that even weak data integration has some value, and that collectively, the dataset types can offer enhanced insight.
microarrays-04-00255-t003_Table 3 Table 3  Algorithm incorporating NSEx. Qualitative results: AUROC and AUPR values obtained after 10 runs with each algorithm and, in parentheses, standard deviations for subsets of 9 runs (see Section 2.3 for details on how these were computed). Variants: SC (SC time series only, without integration of additional data), SC+NSEx.KO (using knock-out experiments for NSEx), SC+NSEx.GO (using GO annotations for NSEx), SC+NSEx.BSA (using binding site affinities for NSEx), SC+NSEx.CORR (using gene-correlations for NSEx) and SC+NSEx.ALL (using all data for NSEx). For additional datasets, BSA followed by KO lead to improved sets of interactions, while CORR affects selection adversely. However, the combined effect of all data types provides optimal inference of the interaction set.   Quantitative analysis was also performed, with RMSE values shown in Figure 1. No improvement in terms of the simulation capability of the models was obtained, although if descriptions of interactions can be improved, so too should gene expression pattern simulation. The current limitations may be due to persistent distortion of the fitness landscape by noise or may be inherently linked to under-determination. The evaluation criterion based on SC data only is crude, so that reverse engineering becomes increasingly fuzzy. This argues strongly for inclusion of further data types in model selection, through the NSEv mechanism.
Figure 1  Algorithm enhanced with NSEx: quantitative results. The graph shows the distribution (over 10 runs) of RMSE on test data (DC dataset) for models obtained with algorithm variants (as for Table 3). A t-test performed for each enhanced version to compare performance to that of the basic SC variant gave p-values as shown. No significant change was observed in RMSE values after integration.

3.2. Integration for NSEv
Having observed an increase in the qualitative value of models for NSEx using all data, we then included the NSEv mechanism (SC+NSEx.ALL+NSEv.ALL) in the algorithm, again using all data available to evaluate model performance (i.e., not based on reproduction of SC expression patterns alone). The aim was to decrease quantitative simulation error and provide further improvement in terms of the identification of transcriptional interactions. Figure 2 shows the simulation performance for the models obtained, compared to the SC and SC+NSEx versions of the algorithm. This improves markedly when all data are included in the evaluation and provides some support for our hypothesis that extending the fitness landscape with topological data can improve simulation performance.
Figure 2  Algorithm enhanced with NSEx and NSEv: quantitative results compared to SC+NSEx and SC only. The variants are: SC (time series only, without integration of additional data), SC+NSEx.ALL (using all data for NSEx), SC+NSEx.ALL+NSEv.ALL (using all data for both NSEx and NSEv) and SC+NSEx.ALL+NSEv.BSA (using all data for NSEx, but BSA only for NSEv). RMSE values show improvement compared to the previous integration strategy; small differences between NSEv.ALL and NSEv.BSA are observed. This suggests that including all data in NSEx scoping with BSA data for refinement in NSEv is optimal.   Table 4 shows the quality of interactions obtained after integration. In contrast to the error improvement, this appears to decrease compared to the SC+NSEx case, which is surprising considering that the quantitative behaviour improves. This could be due to the fact that the models contain indirect interactions (which might include the PPIs mentioned earlier), enabling good simulation of gene expression levels. However, we are interested in uncovering direct transcriptional interactions. The presence of indirect interactions may be more prominent for certain data types, such as correlation patterns (CORR), KO experiments and GO annotations. While these were filtered out in the case of NSEx (a weak integration criterion), they were forcibly included for the more stringent integration criterion of the NSEv mechanism. Hence, more accurate interactions and maintenance of good simulation performance might be obtained through a hybrid approach using all data for NSEx (i.e., the landscaping step) and only BSA data for NSEv (SC+NSEx.ALL+NSEv.BSA). This refinement is suggested by the fact that BSA data usually indicate direct interactions (the ability of the protein transcription factor to physically bind to the target gene). We tested this hypothesis and indeed found that the best compromise for qualitative and quantitative performance is obtained by using this integration approach, as Table 4 and Figure 2 show.
microarrays-04-00255-t004_Table 4 Table 4  Algorithm enhanced with NSEx and NSEv: qualitative results compared to SC+NSEx and SC only. AUROC and AUPR values obtained after 10 runs with each algorithm are shown, together with standard deviations for subsets of 9 runs in parentheses (see Section 2.3). Variants are SC (time series only, without integration of additional data), SC+NSEx.ALL (using all data for NSEx), SC+NSEx.ALL+NSEv.ALL (using all data for both NSEx and NSEv) and SC+NSEx.ALL+NSEv.BSA (using all data for NSEx, but BSA only for NSEv). Integrating all data at the evaluation stage decreases the quality of interactions compared to those obtained with NSEx. Use of BSA alone for evaluation yields better results.   We also investigated the possibility of eliminating the CORR data from the NSEx mechanism and combining with NSEv.ALL or NSEv.BSA. This is because Table 3 suggests that CORR data are least important for the identification of correct transcriptional interactions. However, the resulting AUROC and AUPR values were not better than the results included in Table 4 for the variants employing CORR data. Specifically, the new AUROC/AUPR values were 0.663/0.054 for NSEv.ALL and 0.763/0.085 for NSEv.BSA. We can thus conclude that although CORR data by themselves do not appear to bring improvement, they are still useful to complement the other datasets. This behaviour was also observed for synthetic data in previous work [30].

3.3. Including All Time Series
The analysis presented in the previous section provides a strategy for data integration that optimises both qualitative and quantitative model behaviour. For this, the DC (dual channel) time series data were used only at the `model testing’ stage, to enable quantitative evaluation. However, once the integration strategy is chosen, models can be further refined by integrating both SC and DC datasets in the reverse engineering procedure, which can also reduce noise overfitting [28].
Finally, therefore, we used the best-performing algorithm variant (NSEx.ALL+NSEv.BSA) to integrate the two microarray datasets (SC and DC) available for the Drosophila melanogaster embryo development. The aim was to obtain better prediction of transcriptional interactions between genes. Figure 3 graphically displays AUROC and AUPR obtained by time-series integration, compared to those found using only the SC dataset for training, with exact values shown in Table 5. The increase in AUROC and AUPR values suggests improved prediction of Drosophila melanogaster gene interactions predicted from combining the two datasets rather than using one only. Equally, the application of the exact same integration strategy (as before) maintained model simulation performance.
Figure 3  Combining the two time course datasets, SC and DC. AUROC and AUPR values (and standard deviations for subsets of models) for gene connections obtained through integration scheme NSEx.ALL+NSEv.BSA are displayed (see Section 2.3). Shown are the SC dataset alone, SC integration (SC+NSEx.ALL+NSEv.BSA) and SC+DC integration (SC+DC+NSEx.ALL+NSEv.BSA). Overall improvement is ∼20% with the combined data and integration scheme specified.
microarrays-04-00255-t005_Table 5 Table 5  Combining the two time course datasets, SC and DC. AUROC and AUPR values (and standard deviations) for gene connections obtained through integration scheme NSEx.ALL+NSEv.BSA are displayed. Shown are the SC dataset alone, SC integration (SC+NSEx.ALL+NSEv.BSA) and SC+DC integration (SC+DC+NSEx.ALL+NSEv.BSA).

4.