> top > docs > PMC:5369021 > spans > 39204-60375

PMC:5369021 / 39204-60375 JSON TXT

Annnotations TAB JSON ListView MergeView

2_test

{"project":"2_test","denotations":[{"id":"28347313-24071849-14906490","span":{"begin":1775,"end":1777},"obj":"24071849"},{"id":"28347313-25340776-14906491","span":{"begin":5039,"end":5041},"obj":"25340776"},{"id":"28347313-26112290-14906492","span":{"begin":5453,"end":5455},"obj":"26112290"},{"id":"28347313-26112290-14906493","span":{"begin":7615,"end":7617},"obj":"26112290"},{"id":"28347313-24071849-14906494","span":{"begin":8298,"end":8300},"obj":"24071849"},{"id":"28347313-26687719-14906495","span":{"begin":8970,"end":8972},"obj":"26687719"},{"id":"28347313-19617889-14906496","span":{"begin":9692,"end":9694},"obj":"19617889"},{"id":"28347313-25433699-14906497","span":{"begin":10538,"end":10540},"obj":"25433699"},{"id":"28347313-8594589-14906498","span":{"begin":10658,"end":10660},"obj":"8594589"},{"id":"28347313-16723010-14906499","span":{"begin":11162,"end":11164},"obj":"16723010"},{"id":"28347313-23867710-14906500","span":{"begin":14908,"end":14910},"obj":"23867710"},{"id":"28347313-18954561-14906501","span":{"begin":15123,"end":15125},"obj":"18954561"},{"id":"28347313-17075975-14906502","span":{"begin":15127,"end":15129},"obj":"17075975"},{"id":"28347313-17301269-14906503","span":{"begin":15337,"end":15339},"obj":"17301269"},{"id":"28347313-16973716-14906504","span":{"begin":15341,"end":15343},"obj":"16973716"},{"id":"28347313-24763074-14906505","span":{"begin":15452,"end":15454},"obj":"24763074"},{"id":"28347313-22447046-14906506","span":{"begin":17232,"end":17234},"obj":"22447046"},{"id":"28347313-24419005-14906507","span":{"begin":17325,"end":17327},"obj":"24419005"},{"id":"28347313-14760085-14906508","span":{"begin":17427,"end":17429},"obj":"14760085"},{"id":"28347313-16391877-14906509","span":{"begin":17580,"end":17582},"obj":"16391877"}],"text":"Results\nWe described five recent methods for the genome-wide inference of regulatory activity, namely the approach by Schacht et al., RACER, RABIT, ISMARA, and biRte. They all assume the topology of the regulatory network to be known, cast activity estimation as an optimization problem regarding the difference between predicted and measured values, take different types of sample specific omics data into account, and eventually produce a list of regulators like transcription factors or miRNAs, ranked by their estimated activities in the samples under study. We also included ARACNE which is background knowledge-free and uses only local dependency measures to reconstruct a regulatory network and indirectly infer activities. All of the presented methods essentially follow the same goal, i.e., accurate ranking of regulatory activity, but differ in the types of measurements being integrated, the background knowledge necessary for their application, the complexity and refinement of the underlying model of gene regulation, and the concrete paradigm used for solving the optimization problem. Most of the methods, except for the approach by Schacht et al., are available online via a downloadable implementation, a web service, or an R package providing an operable solution for the interested user. Whereas an overview of the main features of each method ca be found in Table 1, we now first compare the algorithms regarding their general properties in a descriptive way.\nThe data sets used for evaluation vary between all methods. Therefore, we further implemented an evaluation framework to compare the method by Schacht et al., RACER, RABIT and biRte in an objective and quantitative way. We used experimental data of three publicly available data sets from TCGA [64] and a regulatory network as background knowledge. We first used only mRNA expression data as input to the four methods to ensure the result’s comparability, whereas in a second evaluation step, also other omics data sets were included where possible. We further analyzed the relevance of regulators found by different methods using a literature search.\n\nGeneral properties\n\nExperimental data types included\nThe methods differ in the types of measurements being integrated, which corresponds to the level of detail of their model of gene regulations. All six methods use mRNA as input. RACER, RABIT and biRte can also integrate CNV, DNA methylation, TF/miRNA expression data, or somatic mutations. ISMARA calculates an input signal from microarray, RNA-seq, or ChIP-seq data.\nAdditionally, all presented methods use prior knowledge about the underlying regulatory network. These networks are extracted from different data sources and pre-processed in different manners. All methods require at least knowledge about TF – gene relationships, yet RACER, biRte and ISMARA also incorporate information about miRNAs. When using RABIT, the user can choose whether to provide TF or RNA binding protein information. The approach of Schacht et al. and biRte extract regulatory information partly from the commercial MetaCore™ database, whereas the other methods use only publicly available databases, like ENCODE, JASPAR or TRANSFAC. The networks which are used for the evaluations published in the respective papers are publicly available for the case of RACER (network for 16653 genes, 97 TFs and 470 miRNAs), RABIT (predicted binding scores of 63 RBP motifs and 17463 genes) and biRte (network for E.coli including 160 TFs). Neither Schacht et al. nor ISMARA make this data available.\n\nMathematical models of regulatory activity\nThe methods use different mathematical models to infer regulatory activity. The approach by Schacht et al., RACER, RABIT and ISMARA use linear regression whereas biRte applies a probabilistic framework. ARACNE, as a local method, is based on mutual information. RACER and RABIT can be seen as extensions of the approach by Schacht et al. since they essentially use the same model structure but incorporate more input data types and more classes of regulatory information. Further, RACER applies a two-stage regression to infer regulatory activity.\n\nOptimization frameworks\nFor assessing regulator activities, Schacht et al., RACER, RABIT and ISMARA minimize the sum of error terms between measured and predicted gene expression. However, the methods use rather different algorithms for solving the resulting optimization problem, and also apply different constraints to achieve model sparsity, robustness of inference, and feature selection. In the approach by Schacht et al., the regression model is computed for each gene separately and allows only a maximum number of six regulating TFs. RACER uses a LASSO approach, while ISMARA follows a Bayesian model that infers regulator activities as posterior distributions. LASSO can be interpreted as a Bayesian model using Laplacian priors instead of Gaussian priors in the regression framework obtaining point estimates of the regulatory activities and enforcing sparseness of the solution [32]. In contrast, biRte uses a likelihood model with a spike and slab prior to induce model sparsity. This approach implements a selective shrinkage of model coefficients such that estimates are less biased compared to a LASSO prior [65]. With the help of the spike and slab prior, sparsity can be controlled in a variable dependent manner allowing the inclusion of prior belief in the activity of each regulator [35].\n\nComputed outputs\nSchacht et al. and biRte determine activity of regulators over all samples at once, whereas RACER and biRte first infer sample-specific activities which are combined to cross-tumor activities only in a second optimization step. In contrast, ISMARA in first place infers motifs activity; these activities are used to deduce the effects of TFs and miRNAs by their motif binding profiles. ISMARA primarily provides sample specific TF and miRNA activity but also offers an option to group samples and compare average regulatory activity between different conditions. Like biRte and ARACNE, it also infers the network of the regulators themselves.\n\nMethods and data sets used for evaluation\nThe type and extent of evaluation performed for the different methods vary greatly. They range from direct application to biological problems over the comparison of results to the biological literature to simulation studies. All methods published evaluations results on publicly available datasets, e.g., from the National Cancer Institute, TCGA or GEO, but unfortunately address different tissues and cancer types. Sample-based cross-validation is applied in the work by Schacht et al., RACER, RABIT and ISMARA. The first two of these methods use correlation coefficients between measured and predicted gene expression for assessing prediction quality. RACER, RABIT and biRte compare their results to the outcome of other algorithms and to those of restricted models, for example excluding one type of the input variables. All methods search the literature to compare their predictions to previously published studies on the respective biological question. Overall, ISMARA provide the most extensive biological evaluation using a battery of relevant use cases, whereas biRte excels in systematic simulation studies. Sadly, there are very few works which compare any of the methods presented on the same problem; the only result we are aware of compared ARACNE and biRte regarding their performance in network reconstruction on simulated data, in which biRte attained higher robustness against false positive and false negative target gene predictions [35].\n\nQuantitative comparison\nAlthough certain evaluation steps were carried out for all methods, results in the original papers are not comparable as they used different input datasets, different background regulatory networks, and different evaluation metrics. Therefore, in addition to the comparison of general properties of the methods, we implemented an evaluation framework using three independent and publicly available test data sets to compare the method by Schacht et al., RACER, RABIT and biRte in an objective and quantitative way. All evaluated methods were given the same regulatory network as input.\n\nData sets\nFor the evaluation we used experimental data from TCGA [64] for three cancer types: Colon adenocarcinoma (COAD), liver hepatocellular carcinoma (LIHC) and pancreatic adenocarcinoma (PAAD). For all three cancer types, mRNA expression, CNV, DNA methylation and miRNA expression data is available for primary tumor and normal tissue samples. These data sets are openly accessible via the NCI Genomic Data Commons Data Portal3 or the NCI Genomic Data Commons Legacy Archive4 (DNA methylation data).\nFor mRNA gene expression we used processed RNA-seq data in the form of FPKM (fragment per kilobase of exon per million mapped reads) values. The files included Ensembl Gene IDs which were converted to HGNC symbols using the Ensembl [66] BioMart tool5 to match the IDs of the TF – gene network. In two cases, when multiple Ensembl Gene IDs mapped to one HGNC symbol, we chose the gene with highest log2 fold change between case and control group. miRNA expression was given as RPM (reads per million miRNA mapped) measurements. Both mRNA and miRNA data were centered using a weighted mean such that the mean of the case group equaled the negative mean of the control group, and normalized via a weighted standard deviation. CNV data was retrieved as masked copy number segment where the Y chromosome and probe sets with frequent germline copy-number variation had already been removed. Chromosomal regions were mapped to genes using the R package biomaRt [67]. If multiple records mapped to one gene, the median of the segment mean values was calculated. For DNA Methylation data we used the beta-values of Illumina Human Methylation 450 arrays as methylation scores. Multiple scores for the same gene were averaged within a sample.\nWe restricted our analyses to the samples for which all four input data types were available. When multiple measurements for one sample and data type were available, we used only the first one in alphabetical order of the file name. After this selection procedure, 165 samples remained for COAD, 404 for LIHC and 180 for PAAD. A list including sample and file information is available in Additional file 1.\nTogether with the experimental data, all evaluated methods were given the same regulatory network as input. We used a publicly available human TF – gene network [28] based on a text-mining approach and complemented it with TF – gene interactions from the public TRANSFAC6 database [19]. This network included 2894 interactions between 429 TFs and 1218 genes. The network is provided in Additional file 2.\n\nEvaluated methods\nWe conducted the quantitative comparison for the method proposed by Schacht et al., RACER, RABIT and biRte. ISMARA was not included since it is (a) only available as a web service, (b) can only be used with its own, proprietary underlying regulatory network model, and (c) requires the upload of raw data which is prohibited by TCGA’s terms of use. Also ARACNE [30] was not included in the quantitative evaluation since it does not use background knowledge and we therefore consider its results as incomparable to the other methods.For the approach by Schacht et al. we re-implemented their method as closely as possible to the original design using Python and the Cuneiform workflow language [68, 69]. Due to the high number of integer parameters in the original method, the complexity of optimizing the whole network at once would have by far exceeded computational measures. Therefore, like in the original paper, we computed the model for each gene separately and restricted the number of regulating TFs per gene to six. We added a second step where we used these TF – gene interactions building a sub-network to optimize TF activity globally to describe the interplay of the TFs’ effects on their target genes. As in the implementation of Schacht et al., we used the Gurobi Optimizer.7 For RACER we used the available R scripts8 and extracted the resulting sample-specific regulatory activities.RABIT published a C++ implementation which they provide on their website9 and which we used with the FDR option set to 1. As RABIT takes differential expression into account, we used the difference of expression values between case and control group as input and ordered the TFs by t-value as proposed in the RABIT paper.BiRte is available as a bioconductor R package. We used R version 3.3.2 with biRte version 1.10.0 and applied the method “birteLimma” to estimate regulatory activities with the options niter and nburnin set to 10000. As biRte has a randomized component, the resulting TF activities are not exactly the same for different runs. We averaged the final activity scores over 1000 iterations of birteLimma. \nFor our re-implemented method by Schacht et al. and RACER we computed separate models for case and control group and ranked the TFs by their activity difference between the two groups.\nTo ensure the result’s comparability, we first used only mRNA expression data as input to the four methods. In a second evaluation, we included also other omics data sets where possible. BiRte was evaluated on mRNA and CNV data, RABIT on mRNA, CNV and DNA methylation data, and RACER additionally used miRNA expression as input. We obtained lists with the regulators ranked according to the absolute value of their computed activity for each cancer type and method, with and without the use of additional inputs. For each cancer type we calculated the size of the overlaps in the four different results using the top 10 and top 100 regulators. The results for the top 10 regulators using either only mRNA or multiple omics data sets as input are shown in Table 2.\nTable 2 HGNC Symbols of the top 10 regulators found by each method for COAD (using 165 samples), LIHC (404 samples) and PAAD (180 samples) and the use of only mRNA data as input (left panel) and multiple input data sets (RACER: mRNA, miRNA, CNV and DNA methylation; RABIT: mRNA, CNV and DNA methylation; biRte: mRNA and CNV; right panel). TFs with equal activity values are marked with*. TFs found by several method’s top 10 are marked in bold (when found by RACER, RABIT and biRte), blue (RACER and RABIT), red (RABIT and biRte) or yellow (RACER and biRte)\n\nOnly mRNA as input\nWhen only mRNA is used as input, one TF is commonly found by the three methods RACER, RABIT and biRte in each data set, respectively: PHOX2B for COAD, EPAS1 for LIHC and ELF1 for PAAD. A literature search of these TFs and their targets revealed clear associations to the respective cancer type. The TF obtained commonly for COAD, PHOX2B, is related to TLX2, a gene which has been shown to play a role in the tumorigenesis of gastrointestinal stromal tumors [70]. EPAS1, which was found in the LIHC top 10 TFs of three methods, is linked to CXCL12, which plays an important role in metastasis formation of hepatocellular carcinoma by promoting the migration of tumor cells [71, 72]. For PAAD, three methods ranked TF ELF1 high, which is related to 14 genes in our network, inter alia to BRCA2 and LYN. Mutations in the BRCA2 gene have been implicated in pancreatic cancer susceptibility [73, 74], whereas the knockdown of LYN reduced human pancreatic cancer cell proliferation, migration, and invasion [75]. These results underline that the methods are able to find biologically relevant information about regulation processes in cancer.\nSeveral TFs in the top 10 are found by two of the four methods For instance, RACER and RABIT have four common top 10 TFs (CDX2, NRF1 and MYC next to PHOX2B) in the COAD data set. However, the top 10 TFs found by the method by Schacht et al. do not overlap with any top 10 TFs of the other methods in any data set. The agreement of RACER, RABIT and biRte in the top 10 TFs hints to the biological importance of the found TFs since this overlap is statistical significant as the probability of finding common TFs in three sets of ten randomly chosen ones out of 429 TFs (p-value) is below 0.006. Additionally, the methods do identify different TFs for different data sets, indicating the importance of the actual cancer specific mRNA expression values and that results are not dictated by the background network.\nThe results for the number of overlapping regulators in the top 100 between the four methods and the three different data sets are shown in Fig. 9. For RABIT, only 76 TFs for COAD (resp. 67 for LIHC and 57 for PAAD) could be ranked since all other TFs had an activity value equal to zero.\nFig. 9 Number of overlapping TFs in the top 100 of ranked TFs per method (for RABIT the overlap with the top 76/67/57 TFs (having activity \u003e 0) in COAD/LIHC/PAAD is shown) \nWhen looking at the overlap of three of the four methods, the number of overlapping TFs is still the highest for the triplet RACER, RABIT and biRte. For the LIHC dataset two TFs are found in the top 100 of all four methods (E2F4 and SOX10). E2F4 is a downstream target of ZBTB7, which was associated to the expression of cell cycle-associated genes in liver cancer cells [76]. Two target genes of E2F4, CDK1 and TP73 were also involved in liver cancer development [77] and proposed as prognostic marker of poor patient survival prognosis in hepatocellular carcinoma [78]. Further, epigenetic alterations of the EDNRB gene, a target of SOX10, might play an important role in the pathogenesis of hepatocellular carcinoma [79]. Even if the result of four methods finding two common TFs is not statistically significant (p-value = 0.36), their association to liver hepatocellular carcinoma shows that the methods reach their goal of identifying relevant TFs.\nHowever, when comparing different data sets, the methods tend to rank the same TFs under the top 100 to a greater or lesser extent. For example, the overlap of all top 100 TFs of the three cancer types is only one TF for RABIT and nine TFs for biRte, but 16 TFs for the method by Schacht et al. and even 32 TFs for RACER. Therefore, the results from RABIT and biRte seem to be more cancer type specific and less dependent on the regulatory network than the results from RACER. However, we did not specifically investigate the influence of the underlying network and its topology on the results which would be an interesting point for further research.\n\nMulti-omics data as input\nWhen not only taking mRNA into account but also miRNA, CNV and DNA methylation, the results are more difficult to compare between the methods, since they all use a different way of combining different types of data due to their models and implementations.\nWe are aware of the lower level of comparability of this approach regarding the multi-omics results in contrast to a scenario, where all methods are evaluated on the same set of input data. However, we intended to use maximum set of input data for each method to cover the effect of the use of multiple omics data sets compared to only mRNA as input.\nBiRte was evaluated on mRNA and CNV data, RABIT on mRNA, CNV and DNA methylation data, and RACER additionally used miRNA expression as input. Whereas RACER and RABIT considered CNV or DNA methylation data as one background factor and compute only one activity value, biRte evaluated the influence of each CNV separately.\nThe results (see Table 2, right panel) show that RACER exclusively ranks miRNAs high; not a single TF is found among the top 10 regulators. Also, the influence of CNVs was high in LIHC and PAAD. However, the TFs that RACER found in the top 10 when using only mRNA data as input are still ranked high in the multi-omics scenario, e. g the COAD top three TFs of the mRNA results are ranked 13th, 16th and 14th in the results of the multi-omics input. The difference of the results coming from the two input types is less for RABIT: seven TFs are still in the top 10 for COAD (8 for LIHC and 6 for PAAD) when using CNV and DNA methylation additionally to mRNA data. Therefore, the contribution of additional input data seems not to be crucial for the performance of RABIT. BiRte considers each CNV as a potential regulator which increases the total number of regulators enormously. Still, two commonly present TFs in the top 10 of the COAD data set (even six for LIHC and one for PAAD) are found by either the sole mRNA input and the multi-omics approach.\nThe overlap of the top 10 of RABIT and biRte in the multi omics case is considerable with three TFs in LIHC (HNF4A, EGR1 and MTF1; p-value = 0.001), and one TF in PAAD (SPI1; p-value = 0.21). Three of them (HNF4A, MTF1 and SPI1) were already found when using only mRNA data as input.\nThe results for the use of different input data sets show that the top ranked regulators are drastically changed when using additionally miRNA data in RACER, but change less when only CNV or DNA methylation data is provided in RABIT and biRte. However, the results from multi omics analyses are difficult to compare since the combination of input data sets is not consistent across the three different methods.\n\n"}