1. Introduction The discovery of highly-reliable biomarkers from high-dimensional microarray data is an important goal in molecular medicine, with wide-ranging clinical applications. Potential roles for biomarkers include early detection of disease in healthy individuals, disease classification, prognosis, prediction of response to therapy, and as surrogate outcomes in clinical trials [1]. The ideal biomarker is inexpensive, robust, easily interpretable, well-validated, and clinically useful (e.g., improving prognosis or choice of therapy) compared to current standards of practice, meaning that the result is “actionable, leading to patient benefit” [1]. Publicly-available microarray data has vast potential to serve as a source of biomarker discovery as there is an enormous quantity of existing gene expression data [2,3]. At the present time, the Gene Expression Omnibus, a repository of array- and sequence-based expression data, currently contains 1,413,278 samples performed on 14,346 platforms [4]. The most widely known of these platforms include the Affymetrix GeneChips (in situ synthesized oligonucleotide microarray) and the Illumina high-density bead arrays [5]. While other types of microarrays exist, such as protein and microRNA [6,7], this review will focus on integration of gene expression data from multiple cDNA microarray platforms as it relates to the discovery of gene signatures that may serve as biomarkers for clinical applications. The integration of multiple data types (e.g., transcriptomic and proteomic data) has been proposed [8], however this is also beyond the scope of our paper. While microarrays measure the expression of thousands of genes simultaneously, it is expected that only a small subset of the genes will be associated with the clinical or biological outcome of interest. This subset of genes, often termed a “gene signature” or “prognostic signature”, has a collective expression pattern that is unique to the outcome of interest and thus has potential to function as a biomarker [9]. The gene signature is typically composed of far fewer number genes (often less than 100 genes) than that on a microarray chip (often more than 20,000 genes) making it feasible for further study using approaches such as quantitative RT-PCR. Point of Care (POC) devices that rely on transcriptional signatures are progressively gaining momentum as diagnostic tools for routine use in the clinical setting, resulting from their practical and affordable application making this approach highly accessible as cheaper diagnostic kits [10,11]. Biomarkers for the monitoring of disease activity of POC are currently lacking. A number of published gene signatures validated using independent samples have been shown to serve as significant predictors of clinical outcome [12,13,14,15]. However, the development of prognostic signatures that are robust and stable (e.g., the same biomarkers are identified in both discovery and validation sets) [16] has proven challenging [17,18,19]. In Section 3, we will discuss recent examples of promising transcriptomic biomarkers for disease diagnosis and prognosis that have been identified using meta-analysis approaches. Published prognostic gene signatures derived from internal validation often show little overlap with genes identified by other study groups [15]. Potential causes of small reproducibility include differences in sample collection methods, processing protocols, and microarray platforms, patient heterogeneity, and small sample sizes [12]. Due to the difficulty of acquiring samples, particularly from human tissue and the associated costs, microarray experiments from single-institution patient cohorts are often composed of small sample sizes. Predictive models trained on the gene signatures identified from these smaller-sized individual studies are less robust [15,20]. Michiels et al. [21] re-analyzed data from nine studies predicting cancer prognosis and found an unstable misclassification rate for the gene signature (defined as the 50 genes for which expression was most highly correlated with outcome) using training sets derived using a re-sampling approach, with performance increasing as the size of the training set increases. Integration of multiple microarray data sets has been advocated to improve gene signature selection [22]. Increasing sample sizes increases the statistical power to obtain a more precise estimate of integration of (differential) gene expression and to assess the heterogeneity of the overall estimate, as well as to reduce the effects of individual study-specific biases [23,24,25,26]. Meta-analysis is most commonly applied for the purpose of detecting differentially-expressed (DE) genes [27] which may serve as a candidate gene signature or be used as features in classification models or classifiers to further refine a clinically useful gene signature [28]. Supervised classification techniques (also known as prediction analysis or supervised machine learning) are the most commonly used methods in microarray analysis that lead to identification of clinically-useful biomarkers (i.e., gene signatures providing improved discrimination between two or more patient groups) [27]. Classification methods for gene signature selection are beyond the scope of this article and have been reviewed elsewhere [29].