2.2. Meta-Analysis In the meta-analysis approach, each experiment is first analyzed separately and the results of each study are then combined. Meta-analysis methods that combine primary statistics (e.g., p-values or effect sizes) require the use of raw gene expression data whereas secondary statistics rely only on ranked lists of genes. Popular methods for meta-analysis mainly combine one of three types of statistics: p-value [46], effect size [47], and ranked gene lists (“rank aggregation”) [27,33,48]. Ranked lists of genes produced for each study (e.g., ranked by order of p-value for DE of each gene) have been aggregated into a single gene ranking (“consensus”) using a number of methods including the rank product method [48]. A number of methods have been developed to test the statistical significance of results based on combining p-values from each study including Fisher’s method, Stouffer’s method, minP, and maxP. Fisher’s method sums log-transformed p-values, whereas Stouffer’s method sums inverse-normal-transformed p-values, to combine statistical significance across studies. The minP method takes the minimum p-value from combined studies, whereas the maxP method takes the maximum of the combined p-values. Rhodes et al. [49] published one of the first papers to combine p-values from individual studies of DE gene expression using Fisher’s method which found improved statistical significance using the combed analysis compared to individual studies. Combined effect size to generate an estimate of the overall effect size and its confidence interval is frequently used in meta-analysis of clinical research data. Choi et al. [47] described one of the first methods to combine effect sizes using a random-effects modeling approach for combining datasets from individual studies of two groups to form an overall estimate of the weighted effect size. The effect size was measured by the standardized mean difference obtained by dividing the difference in the average gene expression between the treatment and control groups by a pooled estimate of standard deviation. The effect size was used to measure the magnitude of treatment effect in each study and a random effects model was used to incorporate inter-study variability. Meta-analysis methods have been categorized based on the hypothesis settings that gene biomarkers are differentially expressed “in all studies” (HSA), “in the majority of studies” (HSr), or in “one or more studies” (HSB) [33,50]. In Fisher’s, Stouffer’s, and minP method, an extremely small p-value in one study likely meets criteria for statistical significance; thus, it detects DE in “in one or more studies” (HSB), whereas the maxP or rank product method tends to detect gene biomarkers DE in “all studies” (HSA). The choice of the statistical meta-analysis method is selected based on the biological purpose of the analysis. A gene serving as a biomarker from a meta-analysis is expected to show concordant biological effects across all or most experiments for a given condition derived from relatively homogenous sources (e.g., up-regulation of a gene predicting risk of lung cancer detection from lung epithelium biopsied from a cohort of smokers versus healthy non-smokers) [51]. While detecting biomarkers DE in all studies seems an ideal goal, it can be too stringent when the number of samples is large, increasing the heterogeneity of experimental, platform, or biological samples [50]. Meta-analysis methods detecting DE in the majority of samples (HSr) are generally recommended as they provide robustness and detection of relevant signals across the majority of samples [33]. Song and Tseng [52] proposed a robust order statistic, rth ordered p-value (rOP), which tests the alternative hypothesis that there are significant p-values in at least a given percentage of studies. This method detects biomarkers DE in the majority of studies (e.g., >70% of studies) based on a user-specific threshold of studies. 2.2.1. Comparison of Meta-Analysis Methods Several comparative studies systematically comparing meta-analysis methods for microarray data have been previously published [33,53,54]. Chang et al. [33] benchmarked the performance of six p-value combination methods (Fisher, Stouffer, adaptively weighted Fisher, minP, maxP, and rOP), two combined effect size methods (fixed effects and random effects) and four combined ranks methods (RankProd, RankSum, product of ranks, and sum of ranks). The 12 meta-analysis methods were categorized into three hypothesis settings (candidate markers DE in “all” [HSA], “most” [HSr], or “one or more” [HSB] studies) based on their strengths for detecting DE genes. They then applied four statistical criteria to the assessment of each meta-analysis method: (1) detection capability (the number of DE genes detected); (2) biological association (degree of association between DE list with predefined genes from pathways related to the disease), stability (randomly splitting the data and comparing results of the two-meta-analyses) and robustness (effect of including an outlying irrelevant study to the meta-analysis). Among the methods based on HSA setting, the maxP performed the worst based on their four criteria and the investigators recommend that it be avoided. Rank product method had improved performance but weaker detection capability. The two methods that tended to detect DE in the majority of samples were the Random Effect Model (REM) and the rth order p-value (rOP). rOP outperformed REM based on stronger biological association and detection capabilities, but this was achieved at the expense of diminished stability and robustness. It is important to note that differentially-expressed genes determined by combing p-values or ranks obtained by two-sided hypothesis testing may result in genes with discordant DE across two-class outcomes which can be difficult to interpret [27]. Wang et al. [37] have proposed one-sided correction of p-values to guarantee identification of DE genes with concordant DE direction. 2.2.2. Association of Meta-Analysis Method to Outcome Variable The objective and type of outcome types (e.g., two-class, multi-class, survival) [24] will govern the choice of both the test statistic (t-statistic, F-statistic, log-rank statistic) and the meta-analysis method (combing p-values, effect sizes, or ranks). Methods combing effect sizes (standardized mean differences or odds ratios) are appropriate for combining two-class outcomes. Meta-analysis of expression studies with continuous outcomes (e.g., using regression or correlation coefficients) and survival outcomes (based on log-rank statistics) have typically been performed using combined p-values [50,55] and can be performed using the MetaDE package [37]. To capture concordant expression patterns for multi-class outcomes, Lu et al. [52] have applied multi-class correlation (min-MCC) because the F-statistic has been found to frequently fail to capture concordant patterns of gene expression.