2.1. Pre-Processing and Quality Control Prior to Integrative Analysis
Ramasamy et al. [24] identified key issues and steps for performing a meta-analysis including identifying suitable microarrays, pre-processing and preparing individual datasets, selection of meta-analysis method, and interpretation of results. A systematic review of microarray meta-analysis studies in the literature has found that the criteria to include or exclude microarray studies is mostly subjective and ad hoc and remains an open question in the field [27]. Two critical pre-processing steps we will highlight here are (i) removing arrays with poor quality and (ii) determining the relationships between probes and genes. Identifying microarrays of poor quality is essential prior to integrative analysis because inclusion of poor quality studies may reduce statistical power and adversely affect the outcome of meta-analysis [27,33]. There are a number of quality assessment packages available for Bioconductor, including Simpleaffy [34] and affyPLM [35] for Affymetrix. The MetaQC package provides six quality control measurements to identify problematic studies across multiple platforms for further assessment of causes of lower quality to determine their exclusion from meta-analysis [36,37].
Another important pre-processing step is ascertaining which probes represent a given gene within and across the different microarray platforms. The relationship between probes and genes may be determined by mapping probes to the gene using sequence-matched datasets or using gene-level identifiers such as Entrez Gene ID available in the annotations packages in R/Bioconductor [38] to unify the microarray datasets. Sources of high-quality probe re-annotation include alternative chip definition files (CDFs) for Affymetrix [39] and ReMOAT (Re-annotation and Mapping for Oligonucleotide Array Technologies) and its associated annotation packages in R/Bioconductor for Illumina [40]. Only genes that are present across the different platforms being integrated will remain for further analysis, while those absent in one or more platforms will be “lost”, reflecting the tradeoff between increasing sample size and power versus decreasing the number of genes analyzed [32]. Co-inertia analysis, a multivariate analysis method that describes the common trends or co-relationships between datasets of two conditions, has been applied to determine the loss of information incurred by reducing the number of genes to the subset common to different platforms [41]. Imputation of gene expression present in some datasets, but not others, to allow these genes to be part of predictive models has been proposed [42].
If multiple probes match a single gene, selecting the probe with the highest interquartile range (IQR) has been recommended [43]. Genes with low mean expression across most studies are typically filtered out prior to meta-analysis. Turnbull et al. [32] applied relatively strict filter thresholds for their microarray integration analysis based on a prior study that found genes with low or intermediate expression have poorer inter-platform reproducibility than highly-expressed genes [17,44]. Furthermore, incorporation of a quality measure based on detection p-values estimated from Affymetrix arrays into the study-specific test statistics within a meta-analysis of two Affymetrix array studies using an effect sized model produced more biologically meaningful results than an unweighted model [25,45].