PMC:4573573 / 3313-18603
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/4573573","sourcedb":"PMC","sourceid":"4573573","source_url":"https://www.ncbi.nlm.nih.gov/pmc/4573573","text":"2. Method\nThe proposed analysis pipeline has two sequential stages: (1) Meta-analysis for detection of highly differentially expressed genes and (2) finding the most correlated path. These are explained next.\n\n2.1. Stage 1: Meta-Analysis for Detection of Highly Differentially Expressed Genes\nMeta-analysis involves the joint study of multiple databases to obtain conclusions that apply across all of them. Meta-analysis can help detect potential genetic cancer biomarkers through the study of microarray databases. However, to this end, a series of difficulties are apparent: (a) Microarray experiments that are publicly available use different technologies, platforms and, most of the times, different scales. Incommensurability renders many meta-analyses efforts unfeasible [1] due to the inability to make comparisons across all experiments of interest. Even when the same units are used, often time, data normalization is required for comparability. (b) There is not an efficient, systematic method to carry out meta-analysis. Most of the studies analyze one particular database and try to generalize the results to other databases or analyze several databases separately and try to make sense of all the independent results [2]. (c) The issue of having a large number of measurements and genes generally results in large number of significant genes that must be validated [3]. (d) Meta-analysis of microarrays—and of high throughput biological experiments in general—is a laborious process that is often outpaced by the development of technology to generate ever-larger data sets. That is, data generation capabilities are larger and grow faster than our abilities to make sense and translate these data into usable knowledge. (e) Large repositories of public data generated through costly microarray experiments could go underanalyzed and underutilized in the fight for cancer when the researchers’ attention shifts to the next high-throughput technology. The problem of making sense of large quantities of data, however, will persist.\n\n2.2. Multiple Criteria Optimization\nMultiple Criteria Optimization (MCO) is a field from Engineering Mathematics that deals with making decisions in the presence of multiple performance measures in conflict, i.e., decisions where optimizing one criterion results in moving away from optimality in at least another criterion. Because of the presence of conflict, an MCO problem does not find a single best solution but rather a set of best compromising solutions in light of the performance measures under analysis. The best compromises define solutions called Pareto-Efficient (or simply Efficient, for short) that define the Efficient Frontier of the MCO problem at hand. A typical multiple criteria optimization with two conflicting performance measures (objectives), PMs, can be visualized as in Figure 1. In this figure, a set of seven candidate points, characterized by their values on both performance measures, are shown. The performance measure represented in the x-axis is to be maximized while the performance measure in the y-axis is to be minimized in this example. The problem is to find those candidate points that dominate all of the other points in both performance measures. In the face of conflict, this will result in a group of candidates in the southeast extreme of the set in Figure 1, solutions 3 and 5. These are Pareto-efficient solutions and, when all of them are accounted for, they integrate the Efficient Frontier of the MCO problem. In this example, it can be noted that among efficient solutions, an improvement in one performance measure can only come strictly at the detriment of another one: moving from solution 5 to solution 3 will result in an improvement in the performance measure associated to the vertical direction, but in a loss in the performance measure associated to the horizontal direction. Note that the general problem involves at least two performance measures to be optimized, where only the case with two performance measures has a convenient graphical representation. An MCO problem, however, can include as many dimensions (or performance measures) as necessary.\nThe general mathematical formulation of an unconstrained MCO problem is as follows: (1) Find x toMinimize fj(x) j=1,2,…,J\nThe MCO problem in (1) can be discretized onto a set K with |K| points in the space of the decision variables so as to define particular solutions xk, (k=1,2,…, |K|) which can, in turn, be evaluated in the J performance measures to result in values fj(xk). That is, the kth combination of values for the decision variables evaluated in the jth objective function. The illustrative example in Figure 1 follows this discretization with J=2 performance measures and |K|=7 solutions.\nThe MCO formulation under such discretization is, then as follows: (2) Find xk (k∈K) toMinimize fj(xk) j=1,2,…,J\nThe solutions to (2) are, then, the Pareto-efficient solutions of the discretized MCO problem. Considering formulation (2), a particular combination x0 with evaluations fj(x0) will yield a Pareto-Efficient solution to (2) if and only if no other solution xψ exists that meets two conditions, from this point on called Pareto-optimality conditions: (Condition 1) fj(xψ)≤fj(x0) ∀j (Condition 2) fj(xψ)\u003cfj(x0) in at least one j\nConditions (1) and (2) imply that no other solution xψ dominates the solution under evaluation, x0, in all performance measures simultaneously.\nFigure 1 Representation of a multiple criteria optimization problem with two performance measures. In previous publications [3,4] our group has demonstrated that if a set of candidate solutions evaluated by multiple performance measures is available, it is possible to determine a series of best compromises between all criteria through a technique called Data Envelopment Analysis (DEA). The idea behind DEA is to use an optimization model to compute a relative efficiency score for each particular solution with respect to the rest of the candidate solutions. The resulting best compromises, identified through their efficiency score, form the envelope of the solution set, therefore the name Data Envelopment Analysis. These solutions are indeed Pareto-efficient solutions of the problem under analysis.\nThe DEA linear programming formulations proposed by Banks, Charnes, and Cooper [5] are shown below: (3) Find μ, ν, μ0+,μ0− toMaximize μTY0max+ μ0+−μ0−Subject to νTY0min=1μTYjmax− νTYjmin+μ0+−μ0−≤0 j=1,…,n μT≥ε·1 νT≥ε·1 μ0+,μ0− ≥0 (4) Find μ, ν,ν0+−ν0− toMinimize νTY0max+ ν0+−ν0−Subject to μTY0max=1 νTYjmin−μTYjmax+ν0+−ν0−≥0 j=1,…,n νT≥ε·1 μT≥ε·1 ν0+,ν0− ≥0 where μ and ν are column vectors containing multipliers to be optimally determined together with scalar variables μ0+ and μ0− in the first case and together with ν0+ and ν0− in the second case; Yjmin and Yjmax are column vectors containing the values of the jth combination of performance measures to be minimized and maximized respectively; and ε is a scalar usually set to a value of 1 × 10−6.\nModel (3) is called the BCC Input Oriented Model and Model (4) is called the BCC Output Oriented Model. Both models are applied to each of the n candidate solutions. A particular solution with an objective function score of 1 (i.e., an efficiency score of 1) using both formulations is in the envelope of the set and is considered to be an efficient solution to the MCO problem. The BCC model is just one of many possible DEA formulations, albeit a very powerful one. This model’s mathematical linear nature provides it with the capability of finding efficient solutions associated with the data set under analysis through a series of piece-wise linear segments. Nonlinear behavior is, then, approached with tractability and with the certainty that at least the efficient solutions lying in the convex part of the frontier are being found. Figure 2 shows an MCO problem solved through with DEA, specifically with the BCC model.\nDEA has several advantages including: (i) computational efficiency owing to its linear optimization structure; (ii) objectivity and consistency of results, which follows from not requiring the adjustment of parameters or assigning weights to the different performance measures by the user, and (iii) capability of analyzing several microarray experiments with incommensurate units. Appendix A discusses the volcano plot, a widely used tool to detect differentially expressed genes, to illustrate how the analyst can bias the results. On the other hand, one limitation of DEA is that of depending on a series of local linear approximations, as shown in Figure 2. Every time that a linear segment is superimposed over the set under analysis, there are genes lying in the nonconvex part of the set frontier that escape detection. These genes could be potential biomarkers, however. In order to circumvent the said disadvantage, the authors proposed that DEA be applied successively 10 times, each time removing the genes found in a particular iteration from the set for subsequent analyses. This strategy results in 10 frontiers, as seen in Figure 3. The number of efficient frontiers is, admittedly, an arbitrary number at this point, thus further refinement is necessary in this aspect.\nFigure 2 Multiple Criteria Optimization Problem solved using Data Envelopment Analysis (BCC model). The efficient solutions are identified through the use of piecewise-linear segments.\nFigure 3 A case with genes characterized by two performance measures. Referring to this figure, and following the proposed method, at this point it is recommended to identify the first 10 efficient frontiers. This can be easily done by identifying the genes in the first efficient frontier through DEA, then removing them from the set and continue with a second DEA iteration. At the end of Stage 1, the analyst is left with a set of differentially expressed genes that can be investigated to establish their role in the condition or illness under study, cancer in this case. This set of genes in the proposed method, however, will be used to determine how these are maximally correlated in Stage 2.\n\n2.3. Stage 2: Finding the Most Correlated Path\nIt is proposed that the most correlated path among the list of candidate genes identified in the previous stage can be found optimally. To this end, the optimization problem identified in the literature as the Travelling Salesman Problem (TSP), is introduced here as a viable model.\nThe TSP is generally stated as follows: a salesman needs to visit n cities and needs to minimize the travel distance starting and finishing in the city of origin. Each city must be visited only once. The solution, then, is a tour. In n cities, there is a total of n! tours. If a particular city of origin is selected a priori, then the number of tours is (n-1)!. In our case, the objective is to find the tour among n genes of interest that maximizes the sum of the absolute values of pairwise correlations. This tour would then be interpreted as a surrogate for a biological pathway, defined as “a series of actions among molecules in a cell” [6], and more specifically for a genetic signaling pathway. A biological pathway “can provide clues about what goes wrong when a disease strikes.” [6].\nAs a first approximation, it is proposed that the absolute values of linear correlation coefficients computed among a list of genes of potential biomarkers be used to construct networks such as the one presented in Figure 4, where the TSP can be readily applied. The idea of using a linear statistical correlation is, indeed, widely used in the literature as a means to uncover genetic coexpression. This information, in turn, should help cancer researchers in understanding the disease as well as look for targeted treatments. The paper by Kumari et al. [7] has studied different coexpression measurements, recommending to carry out a preliminary study to determine the most appropriate one for different objectives. It is, then, convenient at this point to resort to the use of the Pearson correlation coefficient as a starting point in this work.\nFigure 4 Representation of the many options for a cyclic path for 5 genes. The TSP can, indeed, be understood as an optimization problem. Consider that cij represents the cost of traveling from city i to city j and let yij be a binary variable, indicating whether or not the salesman travels from city i to city j. Additionally let us define flow variables xij on each arc (i,j) and assume that the salesman has n-1 units available at node 1, which is arbitrarily selected as a “source node”, and he must deliver 1 unit to each of the other nodes [7]. The optimization model is as follows: (5a) Minimize ∑(i,j)∈Acijyij (5b) ∑1≤j≤nyij=1 ∀i=1,2,…,n (5c) ∑1≤i≤nyij=1 ∀ j=1,2,…,n (5d) Nx=b (5e) xij≤(n−1)yij ∀(i,j)∈A (5f) xij≥0 ∀(i,j)∈A (5g) yij=0 or 1 ∀(i,j)∈A\nFollowing the description in [8], let A’ = {(i,j): yij =1} and let A’’ ={(i,j): xij \u003e0}. The constraints (5b) and (5c) imply that exactly one arc of A’ leaves and enters any node i; therefore, A’ is the union of node disjoint cycles containing all of the nodes of N. In general, any integer solution satisfying (5b) and (5c) will be a union of disjoint cycles; if any such solution contains more than once cycle; they are referred to as subtours, since they pass through only a subset of nodes.\nIn constraint (5d) N is an nxm matrix, called the node-arc incidence matrix of the minimum cost flow problem. Each column Nij in the matrix corresponds to the variable xij. The column Nij has a +1 in the ith row, a −1 in the jth row; the rest of its entries are zero. Constraint (5d) ensures that A” is connected since we need to send 1 unit of flow from node 1 to every other node via arcs in A”. The forcing constraints (5e) imply that A” is a subset A’. These conditions imply that the arc set A’ is connected and thus cannot contain subtours [8].\nThe TSP is known to be a hard problem to solve to optimality; however, with a manageable number of entities (nodes) optimality is well within reach. In our group’s experience it has been possible to obtain the optimal TSP tour with a list of up to 100 genes in less than 1 hour of computing time in a personal computer. The Branch and Bound—an exact algorithm—was used to this end, as coded in Matlab. An exact algorithm is defined as one capable to arrive to a global optimal solution—provided that one exists—with certainty. Although it is also possible to use heuristics to approach the TSP, it must be understood that a heuristic method by definition does not provide certainty on arriving to a global optimal solution.\nReferring back to Figure 4, it should be now apparent that in n genes associated to the nodes in the network, it is possible to obtain pairwise correlations to connect all genes among them resulting in a fully connected network. This network, in turn, can be mathematically translated into formulation (5a)–(5g) to identify the most correlated path. Thus, at the end of this stage, the most correlated path among all candidate genes from Stage 1, will be available as a proxy for a signalling path. The application of this two-stage analysis pipeline is demonstrated next in the context of cervix cancer.\n","divisions":[{"label":"Title","span":{"begin":0,"end":9}},{"label":"Section","span":{"begin":210,"end":2042}},{"label":"Title","span":{"begin":210,"end":292}},{"label":"Section","span":{"begin":2044,"end":10153}},{"label":"Title","span":{"begin":2044,"end":2079}},{"label":"Figure caption","span":{"begin":5456,"end":5557}},{"label":"Figure caption","span":{"begin":9265,"end":9452}},{"label":"Figure caption","span":{"begin":9451,"end":9830}},{"label":"Section","span":{"begin":10155,"end":15289}},{"label":"Title","span":{"begin":10155,"end":10201}},{"label":"Figure caption","span":{"begin":12131,"end":12208}}],"tracks":[{"project":"2_test","denotations":[{"id":"26388997-19516959-69475712","span":{"begin":778,"end":779},"obj":"19516959"},{"id":"26388997-19516959-69475712","span":{"begin":778,"end":779},"obj":"19516959"},{"id":"26388997-22212230-69475713","span":{"begin":1231,"end":1232},"obj":"22212230"},{"id":"26388997-22212230-69475713","span":{"begin":1231,"end":1232},"obj":"22212230"},{"id":"26388997-23634293-69475714","span":{"begin":1379,"end":1380},"obj":"23634293"},{"id":"26388997-23634293-69475714","span":{"begin":1379,"end":1380},"obj":"23634293"},{"id":"26388997-23634293-69475715","span":{"begin":5584,"end":5585},"obj":"23634293"},{"id":"26388997-23634293-69475715","span":{"begin":5584,"end":5585},"obj":"23634293"},{"id":"26388997-22783697-69475716","span":{"begin":5586,"end":5587},"obj":"22783697"},{"id":"26388997-22783697-69475716","span":{"begin":5586,"end":5587},"obj":"22783697"},{"id":"26388997-23226279-69475717","span":{"begin":11837,"end":11838},"obj":"23226279"},{"id":"26388997-23226279-69475717","span":{"begin":11837,"end":11838},"obj":"23226279"},{"id":"26388997-23226279-118161786","span":{"begin":12682,"end":12683},"obj":"23226279"},{"id":"26388997-23226279-118161786","span":{"begin":12682,"end":12683},"obj":"23226279"},{"id":"T89358","span":{"begin":778,"end":779},"obj":"19516959"},{"id":"T37634","span":{"begin":778,"end":779},"obj":"19516959"},{"id":"T98660","span":{"begin":1231,"end":1232},"obj":"22212230"},{"id":"T53795","span":{"begin":1231,"end":1232},"obj":"22212230"},{"id":"T42455","span":{"begin":1379,"end":1380},"obj":"23634293"},{"id":"T84798","span":{"begin":1379,"end":1380},"obj":"23634293"},{"id":"T63194","span":{"begin":5584,"end":5585},"obj":"23634293"},{"id":"T3395","span":{"begin":5584,"end":5585},"obj":"23634293"},{"id":"T42027","span":{"begin":5586,"end":5587},"obj":"22783697"},{"id":"T42178","span":{"begin":5586,"end":5587},"obj":"22783697"},{"id":"T14627","span":{"begin":11837,"end":11838},"obj":"23226279"},{"id":"T41227","span":{"begin":11837,"end":11838},"obj":"23226279"},{"id":"26388997-23226279-69475718","span":{"begin":12682,"end":12683},"obj":"23226279"},{"id":"26388997-23226279-69475718","span":{"begin":12682,"end":12683},"obj":"23226279"}],"attributes":[{"subj":"26388997-19516959-69475712","pred":"source","obj":"2_test"},{"subj":"26388997-19516959-69475712","pred":"source","obj":"2_test"},{"subj":"26388997-22212230-69475713","pred":"source","obj":"2_test"},{"subj":"26388997-22212230-69475713","pred":"source","obj":"2_test"},{"subj":"26388997-23634293-69475714","pred":"source","obj":"2_test"},{"subj":"26388997-23634293-69475714","pred":"source","obj":"2_test"},{"subj":"26388997-23634293-69475715","pred":"source","obj":"2_test"},{"subj":"26388997-23634293-69475715","pred":"source","obj":"2_test"},{"subj":"26388997-22783697-69475716","pred":"source","obj":"2_test"},{"subj":"26388997-22783697-69475716","pred":"source","obj":"2_test"},{"subj":"26388997-23226279-69475717","pred":"source","obj":"2_test"},{"subj":"26388997-23226279-69475717","pred":"source","obj":"2_test"},{"subj":"26388997-23226279-118161786","pred":"source","obj":"2_test"},{"subj":"26388997-23226279-118161786","pred":"source","obj":"2_test"},{"subj":"T89358","pred":"source","obj":"2_test"},{"subj":"T37634","pred":"source","obj":"2_test"},{"subj":"T98660","pred":"source","obj":"2_test"},{"subj":"T53795","pred":"source","obj":"2_test"},{"subj":"T42455","pred":"source","obj":"2_test"},{"subj":"T84798","pred":"source","obj":"2_test"},{"subj":"T63194","pred":"source","obj":"2_test"},{"subj":"T3395","pred":"source","obj":"2_test"},{"subj":"T42027","pred":"source","obj":"2_test"},{"subj":"T42178","pred":"source","obj":"2_test"},{"subj":"T14627","pred":"source","obj":"2_test"},{"subj":"T41227","pred":"source","obj":"2_test"},{"subj":"26388997-23226279-69475718","pred":"source","obj":"2_test"},{"subj":"26388997-23226279-69475718","pred":"source","obj":"2_test"}]}],"config":{"attribute types":[{"pred":"source","value type":"selection","values":[{"id":"2_test","color":"#ec93cd","default":true}]}]}}