3.6. Discussion We have studied the general prevalence and impact of a RNA quality, RNA quantity and sequence effects using a large and representative set of microarray samples from the Affymetrix HG-U133a platform. We defined parameters, or used previously defined ones, that quantify each technical artifact based on systematic changes in the microarray intensity signals. We determined appropriate thresholds indicating samples of questionable quality, which associate with or even potentially cause biased expression estimates due to the respective factors. The impact of the technical variables on the expression estimates was analyzed by computing correlations with the first five principal components of the expression space of the HumanExpressionAtlas. The results are summarized in Table 1. microarrays-03-00322-t001_Table 1 Table 1 Prevalence and impact of technical factors that constitute potential sources of batch effects in gene expression experiments. The second column shows the percentage of samples that are critically affected by the respective technical artifact. The selection of appropriate thresholds is reasoned in the respective subsections. Column two denotes the prevalence among all samples in the HumanArraySet, in column three among the subset of samples that have been selected after quality control independently performed by [7]. Column four shows the correlation of the technical variable with the principal components of the expression space, and last column how it changes along with known batches. We found that a large fraction of 10% of the 8131 samples are so severely degraded that they should be excluded from further analysis. While most of these samples where indeed excluded from the HumanExpressionAtlas, further 3% of the low-quality RNA samples passed quality control, highlighting the need for a more rigorous assessment of RNA quality in microarray data analysis. A fraction of 1.6% of the samples has decreased specific transcript levels λ, relating to low amounts of hybridized RNA. Samples excluded by quality control in general have smaller λ values. Analysis of the β parameter shows that about 4% of the samples have low measurement ranges and that β has almost no impact on the predominant variation patterns in the expression space. Importantly, RNA quality and RNA abundance variation were found to have potentially an unexpectedly high impact on the gene expression results. Both technical variables strongly correlate with the most common patterns of expression variation in the HumanExpressionAtlas. The RNA quality measure dk correlates with the second principal component for which previously a biological interpretation was found. The relative specific transcript level parameter λ is highly correlated with another principal component. Together with the observed high prevalence of these artifacts, these technical factors thus constitute major sources of batch effects. We found that sequence effects are highly variable in the investigated Affymetrix HG-U133a platform. The sequence effect size is particularly low among low-quality samples where 4% of the samples are affected. 3% of the samples have a strong guanine-run (GGG1) effect with critical impact on some transcripts [25]. They constitute an important technical artifact, which, however, only affects the expression estimates of some genes in some of microarray samples. It obviously does not affect the majority of features and is consequently not a major determinant of systematic variation in the expression space. The overall impact of the studied sequence effects on the expression space is relatively low. The term “technical artifact” used in this publication refers to all factors which cause improperly calibrated expression measures which then do not properly reflect the amount of mRNA in the object of study [39]. Recall that expression microarrays were designed to measure the relative abundance of mRNA in the samples of interest in a gene- (or exon-) specific fashion. Consequently, correct expression measures by definition should be independent of characteristics such as RNA quality or RNA quantity used for hybridization and also of probe sequences interrogating the respective gene. On the one hand, technical artifacts can be caused by improper sample handling (e.g., by using degraded mRNA or an unsuitable amount of mRNA) or improper instrument settings (e.g., of the scanner or fluidic station). On the other hand, technical artifacts can be inherent in the measuring principle of the microarray technology due to surface hybridization effects not related to transcript abundance, even if samples and instruments were handled ideally. That means that also “biological” effects such as the massive changes of transcript composition in the cell and/or of their total abundance level e.g., with changing malignancy can cause systematic variations of quality measures. Although of biological origin we also assign such variations as “technical’ effects because they are related to unwanted side effects of the microarray technology. We have shown in our previous work (see Methods Section) that these side effects can cause systematic errors of the gene-related expression values with potential impact for the interpretation of the studies in terms of biological function. Hence, the identification of batches of samples differing in their technical characteristics potentially reflects improper calibration and in final consequence biased expression measures. In this publication we used the correlation of selected quality measures with the principal components of the expression data as a surrogate measure of the resulting bias, which reflects associations between the expression measures and the respective technical factor. In worst case, this observation indicates that the change of expression values described by the principal component(s) is affected or even determined by systematic errors caused by the respective technical factors. In best case, the gradient of expression values is “correct” in the sense that it is not affected by the technical factors. In this case the associated gradient of quality measures can reflect an independent variation of the respective technical parameters, which can have their origin also in biological factors such as the malignancy of the tumor samples. In the former situation the expression values and their interpretation is questionable whereas the latter situation potentially indicates unexpected and possibly interesting systematic changes of biological factors such as global changes of total expression level in the cells which, for example, can shift the relation between absent and present genes. Unfortunately the “worst” case is much more common according to our experience and it is hard to explain what, e.g., the Guanine-effect or truncated transcripts have to do with changes of transcriptional regulation. On the other hand, the second situation cannot be excluded owing to interferences between the different quality measures and/or with other, still unknown biological factors. A more detailed analysis should extract consequences on gene level to identify the particular causes of the observed biases of gene expression and of the quality measures in the samples studied. Such forthcoming studies are of high interest because they not only can help to improve the calibration of microarrays. They will also enable new insights into biological mechanisms underlying the observed biases such as the systematic variation of the total expression level as a function of biological factors. Methods to gradually correct these biases are required at the level of data analysis. A natural first choice are the existing batch effect removal methods (see [4] for a review) which can help significantly reducing technical variation. These methods rely on batch information and thus cannot cope with covariation patterns more complex than available surrogates/sample groupings. For some covariates, additional parameters can be estimated using physico-chemical measures of the microarray hybridization. This can also be used to correct specific technical effects as we have shown previously for RNA degradation and sequence effects (see [5,25,40] for discussion and software tools). However, such methods are currently only available for a limited set of expression quantification technologies such as the GeneChip expression arrays. It is therefore of great importance to consider the downstream effects of the unavoidable technical variation already in the lab. Recording and storing independent measures for technical parameters such as RNA quality and quantity should be mandatory for all processed samples. Only with the help of this meta-data known technical variation can be reliably separated from biological variation in large-scale expression studies and integrative analyses. It is highly probable that these factors affect not only the GeneChip microarrays investigated here but also other gene expression technologies such as RNA-seq for which only recently quality control measures have been established [41].