PMC:4979052 / 1032-6690
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"27600351-17132828-69477546","span":{"begin":510,"end":511},"obj":"17132828"},{"id":"27600351-20838408-69477547","span":{"begin":1366,"end":1367},"obj":"20838408"},{"id":"27600351-17597765-69477548","span":{"begin":1550,"end":1551},"obj":"17597765"},{"id":"27600351-22851511-69477549","span":{"begin":2423,"end":2424},"obj":"22851511"},{"id":"27600351-20379172-69477550","span":{"begin":4959,"end":4960},"obj":"20379172"},{"id":"27600351-19015125-69477551","span":{"begin":4961,"end":4962},"obj":"19015125"}],"text":"1. Introduction\nHigh-throughput transcriptome profiling is an essential technique in biomedical research. Countless gene expression studies have been performed using various available microarray platforms, including large-scale and meta studies comprising hundreds of experiments. The latter are enabled by the large and ever-growing number of samples archived in public data repositories. For example, about 250,000 new microarray samples have been publicly released and indexed by the ArrayExpress database [1] each year since 2010.\nThe general aim of gene expression experiments is to find systematic expression changes relating to variations in biological or environmental conditions. Systematic expression changes due to technical factors, however, constitute a bias that can negatively affect the reliability of the expression results. Technical factors often change between groups of samples (“batches”) where the technical variables, for example, could be that two different lab workers handle the sample groups, or that the experiments are carried out at different dates or locations.\nThese batch effects are a major issue in microarray data analysis, and correlation of such factors with the biological variables of interest can prevent identification of the true biological source of variation and render the results of a microarray experiment worthless [2]. An anecdotal example for this is the following case. Akey et al. reanalyzed a large microarray-based study intending to assess gene expression variation between human populations [3]. They found that the largest expression changes were not between the populations but rather between groups of samples processed at different time points—79% of genes were found to be differentially expressed between processing years but within the same population. This large amount of variation can hardly be reasonably explained by biology. Akey et al. concluded that the data possesses a systematic and confounding technical bias, and that the reliability of the obtained results is at least questionable.\nIn face of the widespread occurrence and the negative impact on the reliability of the outcome, it raises the question about the origins of batch effects. Currently they are often handled merely as a statistical issue (i.e., as an unspecified source of variation) that can be detected within experimental data using a hypothesis- and thus model-based approach [4]. To identify and study the various sources of batch effects, together with their prevalence and their impact, is as important as it is challenging. One can assess the presence of a batch effect in a data set by testing for correlations between a potentially confounding factor and the expression measurements, but this requires that one has information on the factors potentially varying between the groups of samples. In practice, only very few of those factors are recorded in the course of an experiment, and one is often left with no more than the experimental date or location. In lack of other meta-data, which ideally should have been stored alongside the experiment, date and location of measurement are frequently used as surrogate variables for the assessment of batch effects.\nThe causal “true” sources of technical variation, however, are more likely to relate to other factors such as the specifics of the used hybridization buffers and instruments, or the quality of the amplified RNA. Some of these experimental factors can be assessed by relying on available primary data—raw probe intensities in the case of microarrays. Making use of the particular design of the devices and protocols indicators of technical properties of a specific microarray hybridization can be extracted from the primary data [5].\nWe here aim to investigate the general prevalence and the impact on the expression results for a number of technical factors using a large and representative set of microarray samples. We will rely on the common Affymetrix HG-U133a platform, focusing on variations of RNA quality, RNA quantity and sequence effects, which we suspect to constitute potential sources of batch effects. These technical factors are usually expected to be (a) constant in an ideal experiment and (b) largely independent of the biological variation of interest. Consequently, specific metrics for these factors represent covariates potentially explaining the unwanted technical differences in the expression results. We will not consider general quality-control (qc) metrics such as GNUSE [6] which are well suited to detect low quality microarray samples in general.\nOur analysis uses the HumanArraySet representing a large number of publicly available microarray samples of this platform (see also Methods Section). After strict quality-control, 5372 of these 8131 samples have been selected for the HumanExpressionAtlas, a large collection of human gene expression measurements that can be queried via the ArrayExpress web service [7,8]. This data allows us to investigate technical parameters in sample sets representative for human microarray experiments and in subsets either passing (qc-included) or failing (qc-excluded) quality control. Furthermore, this allows studying the impact of the technical factors on the expression data of the gene expression atlas. Particularly, we focus on quality metrics characterizing RNA degradation, RNA quantity, and sequence biases including multiple guanine effects. We further estimated effect size: We address the important question about the consequences, i.e., how strong the impact of the common variation among these factors is on the results of large-scale gene expression studies."}