PMC:2644708 / 2348-8470
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"19055840-16124883-8161589","span":{"begin":296,"end":297},"obj":"16124883"},{"id":"19055840-12925520-8161590","span":{"begin":757,"end":758},"obj":"12925520"},{"id":"19055840-12925520-8161591","span":{"begin":1082,"end":1083},"obj":"12925520"},{"id":"19055840-11532216-8161592","span":{"begin":1132,"end":1133},"obj":"11532216"},{"id":"19055840-12538238-8161593","span":{"begin":1263,"end":1264},"obj":"12538238"},{"id":"19055840-11532216-8161594","span":{"begin":1319,"end":1320},"obj":"11532216"},{"id":"19055840-15283861-8161595","span":{"begin":1502,"end":1503},"obj":"15283861"},{"id":"19055840-17646307-8161596","span":{"begin":1611,"end":1612},"obj":"17646307"},{"id":"19055840-12925520-8161597","span":{"begin":1813,"end":1814},"obj":"12925520"},{"id":"19055840-12538238-8161598","span":{"begin":1815,"end":1816},"obj":"12538238"},{"id":"19055840-12582260-8161599","span":{"begin":1817,"end":1818},"obj":"12582260"},{"id":"19055840-15702196-8161600","span":{"begin":2044,"end":2045},"obj":"15702196"},{"id":"19055840-11532216-8161601","span":{"begin":3902,"end":3903},"obj":"11532216"},{"id":"19055840-15693945-8161602","span":{"begin":4367,"end":4369},"obj":"15693945"},{"id":"19055840-17288599-8161603","span":{"begin":4920,"end":4922},"obj":"17288599"}],"text":"Background\nGiven the large volume of data generated by microarray technology and the many sources of variation involved, including not only biologically relevant variation, but also technical variation that results from sample preparation and labeling, hybridization, and other processing steps [1], methods for analyzing and interpreting the results are very important. An important component of these technologies is redundancy. For example, the Affiymetrix® GeneChip® platform, employs many features (probes) to interrogate each gene (transcript). These redundant probes are called a \"probe set\" and summarizing each probe set to arrive at a robust value representing the abundance of the associated transcript is one of the first steps in any analysis [2]. Another important step is normalization to remove systematic array to array variation such as differential hybridization and scanning artifacts. Some popular processing methods for these steps are the MAS 5.0 algorithm [3] developed by Affymetrix®, the Robust Multichip Average (RMA) method developed by Irizarry et al. [2], and the dChip® method developed by Li et al. [4]. Each of these include normalization methods: MAS 5.0 uses a simple global scaling factor [3], RMA uses quantile normalization [5], and dChip® uses a rank-invariant set based method [4]. By definition, the global scaling method of MAS 5.0 cannot handle non-linear artifacts. Furthermore, it may not even produce the optimal linear scaling factor as suggested by Lu [6]. Still, the MAS 5.0 method is commonly used and even preferred by some researchers for some applications [7]. RMA has a number of advantages over MAS 5.0 including quantile normalization, which is a mathematically elegant solution for setting the intensity distributions equal for all arrays in the dataset [2,5,8]. However this is not always appropriate and may cause problems when the assumption of equal distributions is not met, for example when more probe sets are up-regulated than down-regulated as discussed by Freudenberg et al. [9]. dChip® uses a rank-invariant set based normalization method on probe level data, however the dChip® approach selects one reference sample and compares all other samples to it, selecting a different rank-invariant set to normalize each sample. Using our proposed method based on a global rank-invariant set to create a robust average reference, which we refer to as Global Rank-invariant Set Normalization (GRSN) and applying it to the summarized probe set data from dChip®, we see a further reduction in specific types of systematic array to array variation.\nFor the purpose of this discussion, we will classify the variation of microarray data into three categories: biological, random, and systematic. Biological variation is of interest to the researcher and may contain many different components. The random and systematic categories are both forms of technical variation. Random variation has no biological relevance and is the result of uncharacterizable measurement errors. Systematic variation also has no biological relevance, but is characterizable as a function of expression value. We are interested in systematic variation because it can be \"modeled\" and removed. For example, if any of the microarray processing steps (labeling, hybridization, scanning, etc.) are non-linear functions of transcript abundance, and the conditions affecting these non-linear functions change from array to array or with biologic condition due to unbalanced gene expression, then systematic variation will be introduced between arrays and/or conditions. The goal of this work is to graphically show and mathematically remove this type of variation, which we refer to as \"non-linear artifacts\" or \"skew\". We use rank-invariant transcripts as reference points to detect and remove these non-linear artifacts. This is not a new concept, for example, Li and Wong [4] use rank-invariant probes for normalization in their dChip software. However, we have extended this idea by selecting a global set of endogenous, rank-invariant, transcripts to generate average reference points used to normalize all samples in a dataset. We also apply our normalization as an additional step after using existing methods for summarizing probe sets. The efficacy of additional normalization at the \"probe set level\" has been advocated by others [10]. Using a global rank-invariant set (as opposed to selecting a new rank-invariant set for each sample) reduces the risk of introducing noise into the dataset. Applying our method as a post probe set summary method allows us the flexibility of using our favorite probe set summary method. In addition, the summarized probe set values should give the best estimate of true gene expression, so it makes sense to use these values when selecting the rank-invariant set. With the use of optimized probe set definitions as described by Sandberg and Larson [11], this may become even more beneficial.\nIn our experience with microarray datasets, both from human cancer studies and animal models, it is common to have unbalanced up or down regulation of gene expression between two sample populations. When using the standard data processing methods discussed above, we often see an intensity-dependent skew when comparing conditions in such data. This skew, in turn, introduces errors in further statistical analysis and in the calculation of fold change. These errors will bias the results of gene selection based on statistics and fold change and can lead to the detection of \"statistically significant\" genes that are not in fact differentially expressed. GRSN is a simple, yet robust, method for reducing this type of distortion, and minimizing the chances of obtaining misleading analysis results. We use simulated data to show that GRSN reliably reduces non-linear skew even when actual gene expression is highly unbalanced. GRSN does not introduce bias into the dataset by trying to balance the number or magnitude of up and down regulated genes. As a result, GRSN performs well on a wide range of datasets, including datasets with as few as two samples."}