2. Integrative Transcriptomic Data Analysis Two fundamental approaches to combine the information of multiple independent microarray studies from different platforms (termed “integrative analysis” [23]) are meta-analysis and cross-platform normalization (also termed “merging”). A conceptual framework by Hamid et al. [22] classifies microarray meta-analysis as “late stage” data integration as it combines the final statistic results from different studies, whereas cross-platform normalization integrates data at the “early stage”. Application of these approaches necessitates that all of the included studies are testing the same hypothesis and/or performed under comparable conditions or treatments [2,30,31]. While the degree of similarity that is required between “suitably similar” datasets still remains to be determined, cross-platform integration for the purpose of biomarker discovery is most appropriate using relatively homogenous datasets selected to answer well-defined questions [32]. Early or late stage integration of data can be used regardless of the biological question (e.g., differential expression analysis or class prediction). The overall principle of these two approaches is summarized in Figure 1. Figure 1 Outline of two microarray integration methods: (a) meta-analysis (“late integration”). Individual case-cohort microarray studies are pre-processed and each study is used to identify ranked gene lists which are then combined in the final step; (b) Cross-platform merging and normalization (“early integration”). After pre-processing of individual studies, a single unified case-cohort dataset is generated (“clustered” into cases and cohorts, indicating removal of batch to batch variation) and in this example, used to identify a ranked gene list. 2.1. Pre-Processing and Quality Control Prior to Integrative Analysis Ramasamy et al. [24] identified key issues and steps for performing a meta-analysis including identifying suitable microarrays, pre-processing and preparing individual datasets, selection of meta-analysis method, and interpretation of results. A systematic review of microarray meta-analysis studies in the literature has found that the criteria to include or exclude microarray studies is mostly subjective and ad hoc and remains an open question in the field [27]. Two critical pre-processing steps we will highlight here are (i) removing arrays with poor quality and (ii) determining the relationships between probes and genes. Identifying microarrays of poor quality is essential prior to integrative analysis because inclusion of poor quality studies may reduce statistical power and adversely affect the outcome of meta-analysis [27,33]. There are a number of quality assessment packages available for Bioconductor, including Simpleaffy [34] and affyPLM [35] for Affymetrix. The MetaQC package provides six quality control measurements to identify problematic studies across multiple platforms for further assessment of causes of lower quality to determine their exclusion from meta-analysis [36,37]. Another important pre-processing step is ascertaining which probes represent a given gene within and across the different microarray platforms. The relationship between probes and genes may be determined by mapping probes to the gene using sequence-matched datasets or using gene-level identifiers such as Entrez Gene ID available in the annotations packages in R/Bioconductor [38] to unify the microarray datasets. Sources of high-quality probe re-annotation include alternative chip definition files (CDFs) for Affymetrix [39] and ReMOAT (Re-annotation and Mapping for Oligonucleotide Array Technologies) and its associated annotation packages in R/Bioconductor for Illumina [40]. Only genes that are present across the different platforms being integrated will remain for further analysis, while those absent in one or more platforms will be “lost”, reflecting the tradeoff between increasing sample size and power versus decreasing the number of genes analyzed [32]. Co-inertia analysis, a multivariate analysis method that describes the common trends or co-relationships between datasets of two conditions, has been applied to determine the loss of information incurred by reducing the number of genes to the subset common to different platforms [41]. Imputation of gene expression present in some datasets, but not others, to allow these genes to be part of predictive models has been proposed [42]. If multiple probes match a single gene, selecting the probe with the highest interquartile range (IQR) has been recommended [43]. Genes with low mean expression across most studies are typically filtered out prior to meta-analysis. Turnbull et al. [32] applied relatively strict filter thresholds for their microarray integration analysis based on a prior study that found genes with low or intermediate expression have poorer inter-platform reproducibility than highly-expressed genes [17,44]. Furthermore, incorporation of a quality measure based on detection p-values estimated from Affymetrix arrays into the study-specific test statistics within a meta-analysis of two Affymetrix array studies using an effect sized model produced more biologically meaningful results than an unweighted model [25,45]. 2.2. Meta-Analysis In the meta-analysis approach, each experiment is first analyzed separately and the results of each study are then combined. Meta-analysis methods that combine primary statistics (e.g., p-values or effect sizes) require the use of raw gene expression data whereas secondary statistics rely only on ranked lists of genes. Popular methods for meta-analysis mainly combine one of three types of statistics: p-value [46], effect size [47], and ranked gene lists (“rank aggregation”) [27,33,48]. Ranked lists of genes produced for each study (e.g., ranked by order of p-value for DE of each gene) have been aggregated into a single gene ranking (“consensus”) using a number of methods including the rank product method [48]. A number of methods have been developed to test the statistical significance of results based on combining p-values from each study including Fisher’s method, Stouffer’s method, minP, and maxP. Fisher’s method sums log-transformed p-values, whereas Stouffer’s method sums inverse-normal-transformed p-values, to combine statistical significance across studies. The minP method takes the minimum p-value from combined studies, whereas the maxP method takes the maximum of the combined p-values. Rhodes et al. [49] published one of the first papers to combine p-values from individual studies of DE gene expression using Fisher’s method which found improved statistical significance using the combed analysis compared to individual studies. Combined effect size to generate an estimate of the overall effect size and its confidence interval is frequently used in meta-analysis of clinical research data. Choi et al. [47] described one of the first methods to combine effect sizes using a random-effects modeling approach for combining datasets from individual studies of two groups to form an overall estimate of the weighted effect size. The effect size was measured by the standardized mean difference obtained by dividing the difference in the average gene expression between the treatment and control groups by a pooled estimate of standard deviation. The effect size was used to measure the magnitude of treatment effect in each study and a random effects model was used to incorporate inter-study variability. Meta-analysis methods have been categorized based on the hypothesis settings that gene biomarkers are differentially expressed “in all studies” (HSA), “in the majority of studies” (HSr), or in “one or more studies” (HSB) [33,50]. In Fisher’s, Stouffer’s, and minP method, an extremely small p-value in one study likely meets criteria for statistical significance; thus, it detects DE in “in one or more studies” (HSB), whereas the maxP or rank product method tends to detect gene biomarkers DE in “all studies” (HSA). The choice of the statistical meta-analysis method is selected based on the biological purpose of the analysis. A gene serving as a biomarker from a meta-analysis is expected to show concordant biological effects across all or most experiments for a given condition derived from relatively homogenous sources (e.g., up-regulation of a gene predicting risk of lung cancer detection from lung epithelium biopsied from a cohort of smokers versus healthy non-smokers) [51]. While detecting biomarkers DE in all studies seems an ideal goal, it can be too stringent when the number of samples is large, increasing the heterogeneity of experimental, platform, or biological samples [50]. Meta-analysis methods detecting DE in the majority of samples (HSr) are generally recommended as they provide robustness and detection of relevant signals across the majority of samples [33]. Song and Tseng [52] proposed a robust order statistic, rth ordered p-value (rOP), which tests the alternative hypothesis that there are significant p-values in at least a given percentage of studies. This method detects biomarkers DE in the majority of studies (e.g., >70% of studies) based on a user-specific threshold of studies. 2.2.1. Comparison of Meta-Analysis Methods Several comparative studies systematically comparing meta-analysis methods for microarray data have been previously published [33,53,54]. Chang et al. [33] benchmarked the performance of six p-value combination methods (Fisher, Stouffer, adaptively weighted Fisher, minP, maxP, and rOP), two combined effect size methods (fixed effects and random effects) and four combined ranks methods (RankProd, RankSum, product of ranks, and sum of ranks). The 12 meta-analysis methods were categorized into three hypothesis settings (candidate markers DE in “all” [HSA], “most” [HSr], or “one or more” [HSB] studies) based on their strengths for detecting DE genes. They then applied four statistical criteria to the assessment of each meta-analysis method: (1) detection capability (the number of DE genes detected); (2) biological association (degree of association between DE list with predefined genes from pathways related to the disease), stability (randomly splitting the data and comparing results of the two-meta-analyses) and robustness (effect of including an outlying irrelevant study to the meta-analysis). Among the methods based on HSA setting, the maxP performed the worst based on their four criteria and the investigators recommend that it be avoided. Rank product method had improved performance but weaker detection capability. The two methods that tended to detect DE in the majority of samples were the Random Effect Model (REM) and the rth order p-value (rOP). rOP outperformed REM based on stronger biological association and detection capabilities, but this was achieved at the expense of diminished stability and robustness. It is important to note that differentially-expressed genes determined by combing p-values or ranks obtained by two-sided hypothesis testing may result in genes with discordant DE across two-class outcomes which can be difficult to interpret [27]. Wang et al. [37] have proposed one-sided correction of p-values to guarantee identification of DE genes with concordant DE direction. 2.2.2. Association of Meta-Analysis Method to Outcome Variable The objective and type of outcome types (e.g., two-class, multi-class, survival) [24] will govern the choice of both the test statistic (t-statistic, F-statistic, log-rank statistic) and the meta-analysis method (combing p-values, effect sizes, or ranks). Methods combing effect sizes (standardized mean differences or odds ratios) are appropriate for combining two-class outcomes. Meta-analysis of expression studies with continuous outcomes (e.g., using regression or correlation coefficients) and survival outcomes (based on log-rank statistics) have typically been performed using combined p-values [50,55] and can be performed using the MetaDE package [37]. To capture concordant expression patterns for multi-class outcomes, Lu et al. [52] have applied multi-class correlation (min-MCC) because the F-statistic has been found to frequently fail to capture concordant patterns of gene expression. 2.3. Cross-Platform Normalization Cross-platform normalization (also termed “data merging” [23]) considers all data from experiments across different microarray platforms as a single data set from the same experiment. Direct integration of data sets performed on different microarray platforms may introduce undesirable batch effects due to systematic multiplicative biases [23,32,56]. The level of difficulty present to combine multiple datasets has been termed “dataset complexity” [53]. For example, integrating different Affymetrix platforms is less complex to analyze by meta-analysis or cross platform normalization than datasets performed across very different platforms. Studies using low complexity datasets, mainly from the Affymetrix platform, have directly merged the studies to construct a gene signature [41,57,58,59]. Cross-platform transformation and normalization methods have been developed with an aim to remove the artifactual differences between data from different microarray platforms while preserving the underlying biological differences between conditions. This step is essential, as non-biological differences (“batch effects”) in the gene signature discovery data can obscure real biological differences found between clinical groups. Early attempts at cross-platform merging applied straightforward transformation methods of location and scale (mean and variance) to process the gene expression data from different studies. Batch mean centering [56] is a simple transformative method that standardizes the expression of each gene to have the same center (mean expression). Probe sets can be further transformed to have the same variance or distributions on different platforms [60,61]. While these methods are relatively easy and intuitive, the batch mean centering method has been shown to have only marginal improvement compared to uncorrected data for cross-platform integration of Illumina and Affymetrix data [32]. Probability of expression (POE), a model-based transformation that is estimated based on a method that adopts an underlying mixture distribution that transforms each data value into range [−1,1] has been used for cross-platform merging based on a unified scale as an alternative to using gene-specific summaries [62,63]. While this transformation has been applied for identifying meta-signatures, it has been found to be difficult to compare to other normalization methods [26]. Over the past decade, a number of more complex cross-platform normalization methods have been published and their performance has been compared in several studies [2,32]. Four cross-platform normalization methods found to be generally effective in a comparative review by Rudy and Palafer [2] are the Empirical Bayes (EB) method, known as Combat [64]; the Cross-Platform Normalization (XPN) method [26], Distance Weighted Discrimination (DWD) [65], and the Gene Quantiles (GQ) method developed as part of the WebArrayDB service [66]. Of these four programs, the authors favour DWD and XPN, while the comparative analysis of cross-platform normalization methods on clinical datasets by Turnbull et al. [32] favoured Combat and XPN. We will discuss the results of these comparative analyses in more detail in the following Section 2.3.1. The Distance Weighted Discrimination method, like Support Vector Machines (SVM), is a margin-based classification method that was developed to improve performance over the latter method. Essentially, SVM finds a hyperplane that separates the two classes (i.e., each systematic bias) to maximize the minimum distance of all the data on the hyperplane (the margin). However SVM has data pile-up problems along the margin which have been improved by modifying the margin to maximize the sum of the inverse distance in DWD [67]. DWD adjusts the microarray data by projecting the different batches onto the hyperplane, finding the batch mean and then subtracting out the plane multiplied by this mean. Combat, an empirical Bayes method, estimates parameters that represent the batch effects by pooling information across genes in each batch to shrink the batch effect parameter toward the overall mean of the batch effect estimates across genes [64]. The data are then transformed to remove the effects of the different batch effect parameters across platforms. Combat is performed using either a parametric prior method or a non-parametric method based on the prior distributions of the estimated parameters [68]. Unlike the gene-wise linear approaches of DWD and Combat, the cross-platform normalization (XPN) method developed by Shablin et al. [26] seeks to borrow information across genes and samples via linked row and column clusters in a two-step procedure. First, K-means clustering is used to find blocks of similar genes and samples across the platforms. This approach is robust to the number of row (K) and column (L) clusters. Then, within each block the data is normalized between platforms within this block. The normalized values obtained over multiple clustering performed over repeated runs is then averaged to better capture the data structure. 2.3.1. Comparison of Cross-Normalization Methods A comparative analysis of cross-platform normalization methods by Rudy and Valafar [2] found the DWD classification method to provide effective batch adjustment for microarray data [67] and to be the most robust to variation in treatment group sizes between the platforms with the least loss of treatment information (lower underdetection), while XPN showed the greatest inter-platform concordance [2]. Turnbull et al. [32] also found that XPN had the highest inter-platform concordance. However, they found that DWD removed not only the platform specific systematic bias, but also relevant biological variability between samples (reduced inter-sample variance), while Combat and XPN preserved this biological signal (slightly increased inter-sample variance) while appropriately correcting the platform specific bias (reduced inter-platform variance). The drawback of DWD to “over-normalize” by removing all systematic expression differences between two datasets, including the relevant biological variability has been cautioned by other authors, prompting development of newer methods [60]. Although Combat and XPN have been found to perform well in previous analyses, the user must be cautious when applying this method to datasets that are unbalanced (e.g., different subtypes within each of the batches) as these methods will not be able to distinguish batch effects from biologically-relevant signals [42]. One limitation of some existing cross-platform normalizing methods is that they can only be applied to two batches at a time. While cross-platform normalization steps can be chained together, the effect of these multiple normalization steps or which chaining method is still unclear [60]. 2.3.2. Software and Websites Implementing Microarray Meta-Analysis and Cross-Platform Merging/Normalization Software, including packages in R/Bioconductor and websites allowing users to implement microarray meta-analysis and cross-platform merging and normalization methods are listed in Table 1. Different experiments from multiple different arrays can be directly merged from the CEL files simultaneously using several packages implemented in R [69] including inSilico Merging [70], the CONOR [2], and virtualArray [71]. The inSilico Merging package implements XPN, DWD, and Combat, and the package CONOR additionally implements the GQ method. The virtualArray package allows cross-platform normalization using empirical Bayes methods (default) or the user may select one quantile discretization, normal discretization normalization, gene quantile normalization, median rank scores, quantile normalization, or mean centering [71]. This batch effect removal step can be supervised allowing the user to specify samples into groups based on platform as well as other attributes (e.g., cell type). Before the combined expression data undergoes cross-platform normalization, the data must be transformed to a common scale (e.g., log2) and resolution (e.g., 12, 14, 16, or 20 bit) [71]. As with meta-analysis, low expression and low variance genes are typically filtered out. microarrays-04-00389-t001_Table 1 Table 1 List of software and websites for performing microarray meta-analysis. 2.4. Comparison of Meta-Analysis vs. Cross-Platform Normalization Directly-merged microarray data (or applying cross-platform normalization) has been argued to have better performance than meta-analysis for the identification of robust biomarkers on the premise that “deriving separate statistics and then averaging is often less powerful than directly computing statistics from aggregated data” [57]. In a comparative study, Taminau et al. [23] found significantly more differentially-expressed genes using cross-platform normalization than meta-analysis. An additional advantage of cross-platform normalization is that it allows prediction models applied to a subset of studies to be applied across additional studies from other platforms [27]. While cross-platform normalization has been applied in multiple studies [72,73,74], it has less frequently been used in the literature compared to meta-analysis [2]. A recent comprehensive systematic literature review of studies applying microarray integration methods found that only 27% of the studies directly merged microarray data and this subset of studies were mostly performed on the same platform [27]. One major limitation of existing cross-platform normalization is that they require that every treatment group or sample type be represented on each platform to allow differentiation of treatment effects from platform effects. Furthermore, cross-platform normalization methods do not guarantee elimination of laboratory or batch effects across experiments and Rung and Brazma [3] have argued that microarray meta-analysis provides better control of between-laboratory heterogeneity, which can be estimated using Cochrane’s Q statistic and be correspondingly adjusted.