PMC:2644708 / 8472-42995 JSON TXT

Annnotations TAB JSON ListView MergeView

2_test

{"project":"2_test","denotations":[{"id":"19055840-11532216-8161604","span":{"begin":2768,"end":2769},"obj":"11532216"},{"id":"19055840-16646809-8161605","span":{"begin":10353,"end":10355},"obj":"16646809"},{"id":"19055840-16273092-8161606","span":{"begin":16003,"end":16005},"obj":"16273092"},{"id":"19055840-16528362-8161607","span":{"begin":21778,"end":21780},"obj":"16528362"},{"id":"19055840-17312161-8161607","span":{"begin":21778,"end":21780},"obj":"17312161"},{"id":"19055840-17572062-8161607","span":{"begin":21778,"end":21780},"obj":"17572062"},{"id":"19055840-15525513-8161607","span":{"begin":21778,"end":21780},"obj":"15525513"},{"id":"19055840-16363065-8161608","span":{"begin":21954,"end":21956},"obj":"16363065"},{"id":"19055840-11960917-8161609","span":{"begin":21957,"end":21959},"obj":"11960917"},{"id":"19055840-18483557-8161610","span":{"begin":22232,"end":22234},"obj":"18483557"},{"id":"19055840-16199517-8161611","span":{"begin":32536,"end":32538},"obj":"16199517"},{"id":"19055840-18483557-8161612","span":{"begin":33532,"end":33534},"obj":"18483557"}],"text":"Results\n\nVisualizing non-linear technical variation in microarray data\nIn microarray datasets, we expect only a small fraction of the transcripts interrogated by an array to be differentially expressed. Therefore, when utilizing a scatter plot to compare pairs of samples, we expect to see most transcripts centered along the diagonal line. When this is not the case, further normalization may be required. We have examined over 30 publicly available datasets, and found many to contain samples with systematic non-linear distortions apparent in their scatter plots. In this report, we will consider a variety of datasets demonstrating various degrees of non-linear distortions, and the effect of GRSN correction. An example of non-linear distortions between microarray samples within a dataset is shown in Fig. 1A. This graph compares two normal samples from a study of the inherited disease, Fanconi Anemia (GB dataset) using patient bone marrow samples run on the Affymetrix® HG-U133A GeneChip®. There is a distinctive curve to the data in the scatter plot (top left panel) when the MAS 5.0 method is used to process the data. This \"frown\" is even more evident when the data is plotted using a standard M vs. A plot (bottom left panel). The M vs. A plot [12] provides an optimal visualization of the ratio of two samples as a function of expression level. In Fig. 1A, columns 2 and 3, although not as pronounced, we also see a systematic skewing of the data when the RMA or dChip methods are used to process this data (most apparent on the M vs. A style plots). Similar distortions can be seen in other samples in this dataset and additional examples from this and other datasets are shown in subsequent figures. We have developed a method called Global Rank-invariant Set Normalization (GRSN) in an effort to reduce this type of non-linear technical variation.\nFigure 1 Visualization of typical non-linear artifacts in microarray data and the GRSN method used to reduce them. A. Visualizing non-linear technical artifacts. Top row – standard log base 2 scatter plots comparing normal sample N3 to normal sample N5 from a clinical study of Fanconi Anemia, (GB dataset). Bottom row – the same data as in the top row, but plotted using M vs. A plots in which M is plotted as a function of A where M = log2(Y) – log2(X) and A = (log2(Y) + log2(X))/2 with X = expression values for sample 1 and Y = expression values for sample 2. The probe set summary methods used are (from left to right): MAS 5.0, RMA, and dChip®. B. A flow chart showing the basic steps followed by the GRSN algorithm to reduce the type of non-linear artifact shown in A.\n\nGlobal Rank-invariant Set Normalization\nGRSN is based on the general idea of rank-invariant genes presented by Li and Wong [4]. We extend this idea to select a single, globally rank-invariant set of endogenous genes to be used to normalize all samples in a dataset. These are genes believed to be consistently expressed in all samples within a given dataset and should appear in roughly the same rank order in each sample when sorted by expression level. Importantly, this ordering, or rank, should not be affected by the types of non-linear artifacts that this normalization method is designed to correct.\nAn overview of the GRSN method is shown in Figure 1B. Briefly, all transcripts (representing endogenous genes) are ranked in each sample of a dataset based on expression (as calculated by summarizing probe sets using established methods such as RMA or MAS 5.0). The variance of the rank order for each transcript is then calculated across all of the samples. Transcripts with the highest rank variance are discarded. The remaining transcripts are again ranked and the process is repeated in an iterative fashion. This iteration cycle is important because, for datasets with unbalanced numbers of up and down regulated transcripts, there can be a global shift of transcript rank order caused by the most differentially regulated transcripts. This global shift of the rank order will disappear as the most differentially regulated transcripts (with the highest rank variance) are discarded during the first few iteration cycles. Note that if we require the global rank-invariant set to have rank variance of zero for all transcripts, we will not typically have enough transcripts for an effective calibration curve, i.e. the set of transcripts with exactly the same rank order in all samples is too small. Therefore, the iteration cycle is terminated when a reasonable number of approximately rank-invariant transcripts remain (5000 by default). These probe sets are considered the \"Global Rank-invariant Set\" (GRiS).\nA single virtual reference sample is then created by taking the trimmed mean (mean after removing 25% of the values from the top and bottom of the range) expression value (over the entire dataset) for each summarized probe set (transcript), and M vs. A plots are generated comparing each sample to this virtual reference. This provides a visualization of the effect of applying GRSN. Fig. 2, column 1 shows the M vs. A plots comparing sample (N3) to the virtual reference of the GB dataset after the data is summarized using MAS 5.0 (first row), RMA (second row), or dChip (last row). We then generate a M vs. A plot of the identified GRiS transcripts only, comparing expression values from a given sample to expression values from the virtual reference sample (Fig. 2, column 2, blue points). We use lowess [13] to fit a smooth curve through these points (green line). This smoothed curve is used as the calibration curve for this sample. We then calculate an intensity-dependent adjustment for transcripts in each sample which, when applied, will center the sample's GRiS on the horizontal line of the M vs. A plot (at M = 0). Fig. 2, column 2, also shows the GRiS after calibration of this sample (red dots). Fig. 2, column 3, then shows all transcripts after calibration of the sample compared to the virtual reference sample (red dots). This process is repeated for each sample, with a different calibration curve generated each time (using the same GRiS). Using the trimmed mean values of the GRiS as the reference for normalization provides a robust average across all samples so that the linearity of the normalized data is not affected by a few samples with anomalous non-linear artifacts. Note that these intensity-dependent adjustments are applied additively to the log scaled data and that this is equivalent to an intensity-dependent scaling of the original, non-log scaled data.\nFigure 2 GRSN corrects non-linear distortions apparent in different summary methods. M vs. A plots demonstrating the GRSN method applied to the MAS 5.0, RMA, and dChip® probe set summary methods. Column 1 shows M vs. A plots comparing one selected sample to the virtual reference sample created by taking the trimmed mean expression value of each probe set in that dataset. Column 2 shows the global rank-invariant set (GRiS) of 5,000 probe sets before GRSN normalization in blue and after normalization in red (note change in y-axis scale). The smoothed curve through the rank-invariant set is shown in green. This is the calibration curve used to normalize the selected sample. Column 3 shows all probe sets after GRSN normalization of the selected sample compared to the virtual reference sample. The sample shown is N3 from the GB dataset. The probe set summary methods used are (from top to bottom): MAS 5.0, RMA, and dChip®.\n\nDetermining global rank-invariant set size\nWhen selecting the GRiS, we aim to minimize the rank order variation among transcripts in the set. We do not attempt to select a set with no rank variation because this would normally result in too few transcripts to define a smooth calibration curve. Therefore, when choosing the size of the global rank-invariant set, we must balance the desire for rank-invariant transcripts with the need for a sufficient number of calibration points. The effect of selecting too few transcripts is demonstrated in Fig. 3A. Here, multiple calibration curves (using different numbers of approximately global rank-invariant transcripts) are graphed for a single sample. The five red curves represent calibration curves generated using GRiS sizes of 100, 200, 300, 400, and 500. At this size range, the curves are erratic and segmented. The five green curves represent sizes of 2000, 4000, 6000, 8000, and 10,000. At this size range, the calibration curves smooth out and become more consistent. We conclude that GRiS sizes in the range of 100 to 500 are insufficient, but that sizes in the range of 2000 to 10,000 appear to be adequate.\nFigure 3 Selection of the global rank-invariant set size. A. The effect of selecting different sized Global Rank-invariant Sets (GRiS) on the calibration curves for a given sample. Each curve shows the GRSN calculated adjustment value as a function of expression value (for a given sample). Red curves are from GRiS sizes of 100, 200, 300, 400, and 500. Green curves are from GRiS sizes of 2000, 4000, 6000, 8000, and 10000. The blue curve is from the default GRiS size of 5000. GB dataset comparing Fanconi vs. Normal using MAS 5.0 processed data on left and RMA processed data on the right (notice different y-axis scales). B. The effect of changing the GRiS size on the selection of significantly regulated genes. Each bar represents the difference in the lists of significant genes found due to a change in the size of the GRiS. The red bar represents the default size of 5000 vs. none (not using GRSN) and is meant to show the magnitude of the effect of applying GRSN (a reference point for the other bars). The three blue bars represent, in order from left to right, the difference of 5000 to 10000, 10000 to 15000, and 15000 to 20000. Top Row – GB dataset comparing Fanconi vs. Normal using MAS 5.0 processed data on left and RMA processed data on the right. On left, GSE6802 dataset comparing R vs. C, using RMA. On right, 339RS dataset comparing TA vs. C using RMA. Next, we look at the effect of different GRiS sizes on the detection of statistically significant genes. For each candidate rank-invariant set size, we apply the GRSN method followed by a statistical analysis to identify lists of up and down regulated genes. We used the eBayes and topTable functions from the limma [14] package in BioConductor with a FC cutoff of 1.5 and a False Discovery Rate (FDR) cutoff of 0.05 (5%) to select statistically significant genes. The FDR method [15,16] applies to multiple hypothesis testing. It uses calculated P-values to control the rate of false positives expected from a set of statistical tests. To compare two candidate sizes for the GRiS, we compare these lists of genes. For the two up regulated lists, we count the number of genes that are in one or the other list, but not both. We do the same for the down regulated lists and add the results. This gives us the number of genes affected by the change in the rank-invariant set size. In Fig. 3B, we use a bar graph to report the numbers of affected genes when different rank-invariant set sizes are compared. As a reference point, we compare GRSN with a GRiS size of 5,000 to no GRSN normalization (red bar). This serves to quantify the effect of the GRSN method itself. To quantify the \"stability\" at reasonably sized rank-invariant sets, we compare 5 K to 10 K, 10 K to 15 K and 15 K to 20 K (blue bars). The effect of applying GRSN (red bar) is large while the effect of changing the rank-invariant set size above 5 K (blue bars) is small. In summary, the size of the rank-invariant set does not seem to be critical. Any value in the range of 5 K to 20 K should work equally well on the current high-density arrays. However, given that we want to minimize rank variance in our selected GRiS, we use a default size of 5 K (5000) for high density arrays with greater than 20,000 probe sets.\nThe choice of the smoother span supplied to the lowess function (see Methods section) can also effect the calibration curves. We have evaluated a range of values for this parameter (data not shown) and have chosen 0.25 as the default for GRSN. However, in a few cases, this default value may not be optimal. For example, some datasets (such as the simulation study presented below) produce a GRiS that is not evenly distributed along the full transcript expression range. In these cases, a larger smoother span may be needed to produce a smooth calibration curve. In the case of the simulation study described below, we chose 0.50 for the smoother span (see Fig. 4A). The tradeoff is that increasing the smoother span can lead to calibration curves that do not properly track the GRiS at the extreme ends of the transcript expression range. We recommend starting with the default value of 0.25, but checking the calibration curves plotted by the GRSN method for continuity with the GRiS (see Fig. 2, column 2).\nFigure 4 Simulation study showing the performance of GRSN. Differential gene expression was simulated starting with a dataset containing 10 biological replicates. The simulation was repeated 100 times and each time the 10 samples were divided into two equal groups, gene expression was simulated in the second group, and a randomly generated, non-linear artifact (skew) was applied to each sample. In each simulation, GRSN was applied to correct the simulated skew. A. First row – sample 8 of 10 from simulation 100 of 100. Left hand panel shows M vs. A plot before simulated skew. Second panel from left shows the introduction of simulated skew. Third panel from left shows GRiS and calibration curve from GRSN process. Right hand panel shows data after GRSN correction of simulated skew. Second row – same as first row, but for simulation 81 of 100. B. GRSN improves gene selection results. Left panel shows true positive gene selection results for simulated up and down genes as indicated before simulated skew is added, after simulated skew is added, and after GRSN is used to correct the simulated skew. Critical portions of the y-axis scale are expanded at the top and bottom of the graph. Middle panel shows false negatives with the bottom portion of the y-axis scale expanded at the bottom. Right panel shows false positive results. Data is represented using standard Tukey box plots. C. Average Fold Change (FC) variation. The average FC for each simulated FC value is plotted showing the variation over all simulated genes and 100 simulations. The left third of the graph shows values for simulated data before skew is introduced, the middle third shows values with skew added, and the right third shows values after GRSN correction of skew as indicated. Box plots are shown.\n\nGRSN improves statistical performance in simulated data\nWhen evaluating the performance of GRSN on a given microarray dataset, we are confronted with the typical problem of not knowing a priori which transcripts are truly regulated and by how much. Therefore, we have created simulated datasets where we have artificially introduced differential gene expression so that we do know a priori which genes are regulated and to what degree. We then introduce simulated, systematic, non-linear artifacts (skew) typical of what are seen in real world datasets. This data allows us to evaluate the ability of standard statistical methods to identify the correct up and down regulated genes before the simulated artifacts are introduced, after they are introduced, and after applying GRSN to correct the simulated artifacts. Thus, the performance of GRSN can be evaluated with respect to reducing unwanted variance and improving statistical gene selection performance.\nTo create a relatively realistic simulated dataset, we used a dataset from a cell culture model with 10 biological control replicates (run on Affymetrix® HG-U133_Plus_2 GeneChips® and processed using the RMA method) to obtain typical background variance (the non-linear artifacts for this dataset were relatively small) [17]. In the first stage of the simulation we randomly partitioned the samples in this dataset into two equal subsets, A and B. We then randomly selected unique subsets of genes and introduced simulated Fold Changes (FC) in the B samples. 1000 genes were set with a FC of 1.5 up, 500 2 fold up, 300 4 fold up, 200 8 fold up; and then 200 were set down 1.5 fold, 200 down 2 fold, and 100 down 4 fold. This gives a total of 2000 up regulated genes with FC in the range of 1.5 to 8 compared to only 500 down regulated genes with FC in the range of -1.5 to -4 so that both the number and degree of up and down regulation is heavily biased in the up direction. In the second stage of the simulation we added random non-linear skew to each sample. The third and final stage of the simulation was to apply GRSN to correct the skew just added. We have repeated this complete simulation, starting with the random partitioning of the original 10 control samples, 100 times (randomly selecting 100 unique permutations from the 252 possible permutations). Figure 4A shows M vs. A plots demonstrating typical skews introduced in a selected sample in two of the 100 different simulations. This figure shows a selected sample compared to the virtual reference sample both before and after the introduction of a simulated skew, and then shows the effect of applying GRSN to correct the simulated skew (compare these plots to Fig. 2).\nAt each stage of each simulation (after simulated FC is introduced, after simulated skew is added, and after GRSN is applied to correct the simulated skew), the Standard Deviation (SD) within replicates and the average FC between A and B sample subsets is calculated for each gene. A goal of GRSN is to reduce the SD among replicates. As shown in Table 1, the average SD among replicates is highest in the data with simulated skew and is substantially reduced when GRSN correction is applied. The SD after GRSN correction is almost identical to the SD for the original data before simulated skew is introduced (see Table 1). In addition to removing unwanted technical variation, it is important to preserve biologically relevant variation. In this simulation, the biologically relevant variation is the simulated FC introduced in sample set B. Here we calculate the average for all simulated FC ranges up or down across all simulations. In our study, the average FC value stays relatively constant (within 2–3%) at each stage of the simulation (see Table 1), demonstrating that GRSN does not adversely affect the relevant variation (also see Fig. 4C).\nTable 1 GRSN reduces standard deviation while preserving introduced fold change in a simulated data study. Average Standard Deviation (SD) among replicates and average Fold Change (FC) are reported for each stage of our simulated data study (see text for description). SD values for each gene are calculated separately for sample set A and sample set B and then averaged across both sample sets and all 100 simulations. FC values are calculated for each gene by taking the average for sample set B and dividing by the average for sample set A then averaging over all 100 simulations. The values reported in each column are (from left to right) 1) the average SD for all genes, 2) the average SD for the up and down regulated genes, 3) the average FC for up regulated genes, and 4) the average FC for down regulated genes. Values reported in the top row are for data with simulated FC only. Values in the middle row are for data with simulated skew added. The bottom row reports values after GRSN correction of the simulated skew. Next we evaluated the effects of the introduced skews and GRSN correction on statistical gene selection performance in our simulated datasets. Statistically significant genes were selected with eBayes using a FC cutoff of 1.2 and a FDR cutoff of 0.05. We evaluated the numbers of True Positive (TP) (genes with actual simulated FC), False Positive (FP), and False Negative (FN) genes found at each stage of the data simulation for each of the 100 simulations run. Figure 4B shows the results using box plots showing the range of gene selection results across all 100 simulations. The statistical results from the data with simulated artifacts, but no GRSN correction, vary widely from simulation to simulation, resulting in a substantial reduction in identified true positives (middle data set in left plot), and an abundance of false negatives and false positives (middle data sets in middle and right-hand plots). False negatives are more common than false positives due to the random nature of the introduced skew. However, GRSN corrects these issues and the results both before the simulated artifacts and after the simulated artifacts have been corrected with GRSN are very stable (Fig. 4B, compare left and right-hand data sets in each box plot).\nWe also evaluated the ability of GRSN to preserve the Fold Change (FC) values introduced in the above simulation. We tabulated the average FC for each range over all 100 simulations. This tabulation was done for each stage of the simulation: after simulated FC, after simulated skew, and after GRSN correction. Box plots were used to summarize the results for each FC range and each stage. As seen in Fig. 4C, the variation in FC for each simulated FC range is increased substantially by the simulated skew, but the application of GRSN restores both the mean FC and the variation in FC to values very close to the pre-skew values.\n\nGRSN corrects non-linear distortions in representative microarray datasets\nWe have investigated the application of GRSN on a wide variety of microarray datasets including clinical sample datasets, cell culture datasets with various treatment modalities, and genetic mouse model datasets [18-21] [see Additional file 1]. Two examples are shown in Figure 5A with RMA pre-processing. 1) A mouse model (MKM dataset) of carcinogenesis in cultured clonal keratinocytes [22,23]. Samples were run on the Affymetrix® MOE430A GeneChip®. This dataset represents cell culture based experiments with minimal biological variance between replicate samples. 2) A study of limb development in a mouse model (SS dataset) courtesy of Dr. Scott Stadler at OHSU [24]. Samples were run on the Affymetrix® MOE430A GeneChip®. This study compares mutant vs. wild type mice with three female and three male replicates for each condition. In both cases we see a reduction in the systematic intensity-dependent artifacts observed in these samples with application of GRSN (Fig. 5A, right-hand column).\nFigure 5 GRSN corrects non-linear artifacts in representative microarray datasets. A. GRSN applied to two different microarray datasets. First row – late stage sample L3 from the MKM dataset. Second row – mutant Male sample MutM2 from the SS dataset. Columns 1–3 demonstrate the effect of GRSN on the selected samples as described in figure 2 above. The RMA probe set summary method was used in each. B. GRSN can reduce systematic non-linear artifacts which can affect fold change analysis regardless of pre-processing method. M vs. A plots showing fold change as a function of mean value and plotted on log base 2 scale. Both fold change and mean are calculated using multiple replicates, 14 FA samples and 11 Normal samples from the GB dataset (not just comparing two samples). A lowess smoothed curve is displayed to show the trend of the scatter plots. Three different summary methods are shown: Top row – MAS 5.0, Middle row – RMA, and Bottom row – dChip®. The results in the left column are without GRSN applied and the effect of applying GRSN to each of the respective methods is shown in the right column. When datasets are analyzed for Fold Change between two experimental conditions where each gene's average FC between conditions is plotted versus its average expression for both conditions on M vs. A plots, we also often see non-linear skewing in the data even after averaging replicate samples and regardless of the pre-processing method. This is again likely resulting from the systematic, intensity-dependent artifacts which have no biological significance and it appears at least in some cases to be exacerbated in datasets containing unbalanced numbers of up or down regulated genes. For example, Fig. 5B shows M vs. A plots of the GB dataset comparing 14 Fanconi Anemia samples to 11 normal bone marrow samples before and after applying GRSN. In this example, non-linear distortions are seen without GRSN correction when MAS 5.0 pre-processing is used (top row, left panel), as well as when RMA pre-processing (middle row, left panel), and dChip pre-processing (bottom row, left panel) are used. However, applying GRSN substantially reduces this skew in all cases (Fig. 5B, right-hand panels). Thus, in some datasets, FC assessments can be affected by non-linear artifacts even when averaging multiple, replicate samples and regardless of the probe set summary method used. For each summary method shown, GRSN effectively reduces this skew. The same results are seen with additional datasets [see Additional file 2].\n\nReduction of systematic variation by GRSN\nThe goal of GRSN is to reduce systematic non-linear variation in microarray datasets. GRSN is very successful at this task as demonstrated with simulated data. However, there is also random variation in any microarray dataset and this random variation tends to be larger than the systematic variation addressed by GRSN. As a result, applying GRSN will not reduce the variation of all genes and the variation of some genes will actually increase due to the random nature of the non-systematic variation. Still, in most cases, GRSN will reduce the average variation among replicates as shown in Fig. 6. The main benefit seen from this reduction in average variance is in the genes with relatively small random and biological variations. These genes are at the largest risk of becoming false positives due to systematic non-linear artifacts. An example of this is seen in the SS dataset (Fig. 7C).\nFigure 6 GRSN reduces average variance in datasets. Lowess curves are plotted summarizing the variance (for log base 2 scaled data) of all genes among selected sets of replicate samples and whole datasets. The curves show the trend in the variance as a function of expression values. Dashed blue is RMA processed data and dashed red is RMA processed data with GRSN post processing. A. RS dataset showing variance reduction in Control samples, Myc samples, and control and Myc samples combined. B. SS dataset showing, WT samples, mutant samples, and WT and mutant samples combined.\nFigure 7 GRSN impacts gene discovery. Averaged fold change M vs. A plots as in Figure 5B, but with color coding added to show genes passing fold change and statistical thresholds for significant differential regulation between experimental conditions. Statistical thresholds reported in this figure are for FDR adjusted p-values from standard t-tests. A) GB data comparing 14 Fanconi Anemia samples to 11 Normal samples and plotted using values from RMA method alone (left panel) and using values from RMA with GRSN (right panel). Both plots are color coded to show genes found to be significantly changed (FC of at least 1.5 and FDR of no more than 0.05): genes found only when using RMA alone are in blue, genes found only when using RMA with GRSN are in red, and genes found in both cases are in yellow. The horizontal colored lines show the fold change cutoff applied to the respective summary and normalization methods. B-C) Color coding is modified so that blue genes are shown only in the left panel and red genes are shown only in the right panel. B) GSE6475 data comparing 6 AL (acne lesion) replicates to 6 AN (acne normal) replicates. Samples are plotted as in A. C) SS data comparing 6 mutant samples to 6 wild type (3 male and 3 female for each condition). No FC threshold is applied in this example and the FDR threshold is set to 0.10. D) GSE7664 data comparing 8 bt (treated) to 8 med (untreated) samples plotted as in A.\n\nImplications of GRSN for gene discovery\nAn important goal of many microarray-based studies is the identification of genes with statistically significant differential expression between experimental conditions. As seen with the simulated data study and as shown in real data sets in Fig. 5B, the non-linear skew seen in some datasets is likely to significantly impact standard statistical methods for selecting differentially regulated genes. To analyze this we compared statistical results before and after applying GRSN normalization to a number of datasets. We selected significant up and down regulated transcripts that pass a Fold Change threshold of 1.5 and a False Discovery Rate threshold of 0.05 (similar results are obtained using FDR values ranging from 0.01 to 0.20). In Fig. 7A, we use the same M vs. A plots as in Fig. 5B, but add color coding to visualize genes selected as statistically up or down regulated between two sample classes. The \"S\" shaped skew in the data effects both the calculation of statistical significance and the calculation of fold change. In Fig. 7A, genes from the GB dataset summarized with RMA are color coded based on meeting both a statistical and a fold change cutoff. Transcripts found significant only when GRSN is not applied are indicated in blue, transcripts found significant only when GRSN is applied are indicated in red, and transcripts found significant in both cases are indicated in yellow. The blue horizontal lines indicate the FC threshold applied to select the blue transcripts (left panel) and the red horizontal lines indicate the FC threshold applied to select the red transcripts (right panel). In this case, it can be seen that the skew is pushing large groups of genes in or out of the selected fold change range. In Fig. 7B–D the color coding is modified to show the blue genes only on the left and red genes only on the right and in Fig. 7B there are more blue genes lost with the application of GRSN than there are red genes gained. This result is misleading, because the median p-value for the yellow genes (genes found in both cases) has decreased (improved) from 0.00023 to 0.00021 with the application of GRSN. The reason less genes are found after applying GRSN is due to the FDR adjustment. The p-value required to meet the 0.05 FDR threshold before application of GRSN was 0.0015 while the required p-value after GRSN application was 0.0010. Therefore, the p-value threshold associated with the given FDR value became more stringent after applying GRSN. This is most likely due to the distribution of p-values for genes that did not make the FC cutoff. The cause of this is well illustrated in Fig. 7C with the SS dataset. Here, statistics alone, with the FDR threshold reduced to 0.10 and with no fold change cutoff, are used to select genes. In this case, it appears that there are large groups of genes that are detected as statistically significant due solely to the effect of the \"S\" shaped skew in the data. In fact, applying GRSN in this case reduces the total number of genes selected from approximately 1,800 to only 171 genes significantly up regulated and 295 genes significantly down regulated (1,344 significant genes were removed and only 20 added). These large numbers of false positive results will cause overly optimistic FDR calculations for all genes and removing these false positive results with the use of GRSN results in fewer genes passing the FDR cutoff even when the actual p-values have improved. In Fig. 7D there are a significant number of red genes added when GRSN is applied and no blue genes lost. In this case, the median p-value for the yellow genes improved significantly from 0.000076 to 0.000042 while the p-value required to meet the FDR threshold changed from 0.00020 to 0.00070. In this case, the FDR threshold became less stringent with the application of GRSN. Presumably the benefit from a decrease in variance among replicates out weighed any bias in the FDR calculation introduced by the removal of false positives (no false positive are shown because they are \"masked\" by the FC threshold). In summary, systematic distortions in microarray datasets are likely to adversely impact statistical calculations leading to unreliable gene selection results.\n\nGRSN improves downstream pathway analysis using Gene Set Enrichment Analysis (GSEA)\nIn addition to examining the effects of GRSN on variance and statistical gene selection we have used the GSEA tool [25] to further analyze the effects of GRSN on downstream microarray data analysis. GSEA looks for the enrichment of known pathways (sets of genes) in the \"gene signature\" of a particular experiment. Part of the power of GSEA is that it considers the rank and significance of all genes in the gene signature. Therefore, GSEA will benefit both from an increase in True Positives and in a decrease in False Positive gene selection results with the use of GRSN. We have applied GSEA to both the RS and the SS datasets (using the 'R' implementation, version GSEA.1.0.R). The ability of GSEA to detect pathways shown to be relevant in each of these datasets is evaluated both with and without the use of GRSN. As shown in Table 2, both the Normalized Enrichment Score (NES) and the False Discovery Rate (FDR) for these relevant pathways are consistently improved and in some cases, pathways are only detected when GRSN is used. In particular, VEGF is identified as an important player in the SS study [24] but the associated \"vegfPathway\" is only identified by GSEA when the data is normalized with GRSN. Also, the RS study involves expression of the c-Myc oncoprotein, which is known to induce cell cycle, cell proliferation, cell growth, DNA damage, cell death, and HTERT (see references in Table 2). GSEA identifies all of these pathways to be enriched with c-Myc expression compared to control and as shown in Table 2 all of these pathways are detected at a higher NES and much more significant FDR value when GRSN is used.\nTable 2 GRSN aids Gene Set Enrichment Analysis (GSEA). GSEA is applied to the SS and RS datasets. Pathways known to be active in these datasets are shown and referenced. For each selected pathway, the Normalized Enrichment Score (NES) and the False Discovery Rate (FDR), as reported by GSEA, are shown. NES and FDR values are shown both for data processed with RMA alone (Without GRSN) and for data processed with RMA followed by GRSN (With GRSN) as indicated.\n\nD"}

NEUROSES

PMC:2644708 / 8472-42995 JSONTXT

Annnotations TAB JSON ListView MergeView

2_test

NEUROSES

PMC:2644708 / 8472-42995 JSON TXT