Visualizing non-linear technical variation in microarray data
In microarray datasets, we expect only a small fraction of the transcripts interrogated by an array to be differentially expressed. Therefore, when utilizing a scatter plot to compare pairs of samples, we expect to see most transcripts centered along the diagonal line. When this is not the case, further normalization may be required. We have examined over 30 publicly available datasets, and found many to contain samples with systematic non-linear distortions apparent in their scatter plots. In this report, we will consider a variety of datasets demonstrating various degrees of non-linear distortions, and the effect of GRSN correction. An example of non-linear distortions between microarray samples within a dataset is shown in Fig. 1A. This graph compares two normal samples from a study of the inherited disease, Fanconi Anemia (GB dataset) using patient bone marrow samples run on the Affymetrix® HG-U133A GeneChip®. There is a distinctive curve to the data in the scatter plot (top left panel) when the MAS 5.0 method is used to process the data. This "frown" is even more evident when the data is plotted using a standard M vs. A plot (bottom left panel). The M vs. A plot [12] provides an optimal visualization of the ratio of two samples as a function of expression level. In Fig. 1A, columns 2 and 3, although not as pronounced, we also see a systematic skewing of the data when the RMA or dChip methods are used to process this data (most apparent on the M vs. A style plots). Similar distortions can be seen in other samples in this dataset and additional examples from this and other datasets are shown in subsequent figures. We have developed a method called Global Rank-invariant Set Normalization (GRSN) in an effort to reduce this type of non-linear technical variation.
Figure 1  Visualization of typical non-linear artifacts in microarray data and the GRSN method used to reduce them. A. Visualizing non-linear technical artifacts. Top row – standard log base 2 scatter plots comparing normal sample N3 to normal sample N5 from a clinical study of Fanconi Anemia, (GB dataset). Bottom row – the same data as in the top row, but plotted using M vs. A plots in which M is plotted as a function of A where M = log2(Y) – log2(X) and A = (log2(Y) + log2(X))/2 with X = expression values for sample 1 and Y = expression values for sample 2. The probe set summary methods used are (from left to right): MAS 5.0, RMA, and dChip®. B. A flow chart showing the basic steps followed by the GRSN algorithm to reduce the type of non-linear artifact shown in A.

G