Poor Correlation Between the New Statistical and
the Old Empirical Algorithms for DNA Microarray Analysis
DNA microarray is currently the most prominent tool for investigating large-scale gene expression data. Different
algorithms for measuring gene expression levels from scanned images of microarray experiments may significantly impact the following steps of functional genomic analyses. Affymetrix® recently introduced high-density microarrays and new statistical algorithms in Microarray Suit (MAS) version 5.0®. Very high correlations (0.92 - 0.97) between the new algorithms and the old algorithms (MAS 4.0) across several species and conditions were reported. We found that the column-wise array correlations had a tendency to be much higher than the row-wise gene correlations, which may be much more meaningful in the following higher-order data analyses including clustering and pattern analyses. In this paper, not only the detailed comparison of the two sets of algorithms is illustrated, but the impact of the introducing new algorithms on the further clustering analysis of microarray data and of possible pitfalls in mixing the old and the new algorithms were also described.
DNA microarray is currently the most prominent and widely accepted tool for investigating genomic-scale expression data in an extraordinarily efficient fashion. Increasing number of high throughput gene-expression monitoring technologies are now available. Moreover, there have been series of efforts to share microarray experiment data from different experiments (Brazma et al., 2001; Edgar et al., 2002; Gardiner-Garden and Littlejohn, 2001; Strausberg eta/. 2001).
One important issue is that there are already several technologies for large-scale assays of gene expression without well-established way of cross-platform utilization of data from different technologies (Kuo et a/., 2002). Combining expression measurements across technologies was shown to be non-trivial (Aach et al., 2000). The ratio form of measurements for cDNA microarray complicates a direct comparison with results from other technologies because of the inevitable dependence of the ratio values on the reference sample (Aach etal., 2000).
Affymetrix® recently introduced new statistical algorithms with Microarray Suite (MAS) version 5.0® for monitoring GeneChip® expression data. The new algorithms provide statistical significance (p-values) and confidence limits for Jog ratio values (fold changes), tunable parameters that enable users to vary the stringency of the analyses as well as elimination of negative expression values observed with the old empirical algorithms ( http://www.affymetrix.com/support/technical/technotes/stat istical_algorithms_technote.pdf and http://www.affymetrix . com/support/technical/technotes/statistical_reference_guid e.pdf).
The algorithms were changed from the old empirical to the new statistical ones and the terminology was adapted to fit more standard terms. The AvgDiff (Average Differ ence) that had been used for empirical expression analysis was changed to the ‘Signal’. The Signal is calculated using the One-Step Tukey' s Biweight Estimate, which yields a robust weighted mean that is in relatively sensitive to outliers. The Tukey’s biweight method gives an estimate of the amount of variation in the data, exactly as standard deviation measures the amount of variation for an average. MAS 5.0 subtracts a stray signal’ estimate from the PM (Perfect Match) signal that is based on the intensity of the MM (Mis-Match) signal. However, in cases where the MM signal outweighs the PM signal, an adjusted value is used.
These adjustments will eliminate negative values. The AbsCall is now replaced by the Detection value, which is associated with a Detection p-value based on a ranking user-defined value.
Affymetrix® reported very high correlations (0.92-0.97) between the new statistical algorithms (MAS 5.0) and the old empirical algorithms (MAS 4.0) across several different conditions and species in support of the new algorithms. Based on the results of the comparison study, it seems to be safe enough to apply the new algorithms to ongoing studies. Databases that have not been archived cannot be read under MAS 5.0. However one can use the MAS 5.0 to re-read the CHP file and recreate the database.
We have experienced, however, that the array-by-array type comparison study, which was done by the Affymetrix®, tends to show higher correlation than probe-by-probe type comparison, which may be much more meaningful in that it is used for the functional analysis of genes and gene groups and their transcriptional regulations demonstrated by higher-order data analyses including clustering, pattern analyses, and data mining. In this study, we analyzed the expression values produced by the two different sets of algorithms (i.e., MAS 4.0 and 5.0) from the same microarray data sets independently experimented in our laboratory. Thus, this is, to our best knowledge, the first formal evaluation study applied to the two different sets of algorithms, MAS 4.0 and 5.0, applied to the same data sets. We describe possible pitfalls in mixing the old and the new algorithms as well as detailed characterization of the results.
Two mouse models of cardiomyopathies were tested by Affymetrix MGU74A GeneChip® that contains 12,654 probe elements as a part of cardiogenomics PGA (Programs for Genomics Application supported by NHBLI). It was discovered that 25% of the probes on the MGU74 array set were incorrectly oriented and therefore unusable. A mask file that could be interpreted by MAS to leave blank fields for incorrect probe sets was generated and this mask file has been applied to all our text files that have been generated by these arrays.
We have used the most commonly used surgical intervention for pressure-overload induced hypertrophy, coarction of the transverse aorta (Rockman et al., 1991). The system has been well characterized and proven to be highly reproducible with a low mortality rate of 10% or less.
Aortic banding is an excellent model system to evaluate the process of development of left ventricular hypertrophy in response to hemodynamic stress.
FVB wild-type mice were obtained from Charles River laboratories and operated at 12 weeks of age following the experimental procedure (Rockman et al., 1991). Banded animals and sham-operated (i.e., chest opened surgically, but no banding of the aorta) controls were dissected 1 hour, 4 hours, 24 hours, and 48 hours after the procedure to analyze immediate early and early responses. In addition, the hypertrophic response was examined at one week and eight weeks after the intervention during the chronic phase. All time points have triplicate experiments such that we ran 36 MGU74A chips in total (3 (replicates) * 6 (time points) * 2 (conditions)).
We created benchmark sets of expression profiles at different developmental stages to monitor changes in cardiac gene expression over time. Most of our transgenic and experimental models of cardiac hypertrophy and heart failure are generated in an FVB strain genetic background. We have collected the ventricular portion of the heart at the following stages: Embryonic day 12.5, Neonatal day 1, 1 week of age, 4 weeks of age, 3-5 months of age, 1 year of age. For the embryonic stage, three hearts were pooled for each experiment. Two hearts were pooled at the neonatal stage. Adult hearts were analyzed individually. For adult mice at 3-5 months and one year of age, we have collected hearts separately for female and male mice. We did not discriminate between sex pre-puberty. All timepoints have triplicate experiments such that we ran 18 MGU74A chips in total (3 (replicates) * 6 (time points)).
Pearson correlation coefficients were calculated for genes, arrays, and across all 361,548 matched pairs of measurements. Table 1 shows the correlation between MAS 4.0 and MAS 5.0 reads for both banding and FVB experiments. The correlation coefficients (means and standard deviations) for the overall and array-by-array comparisons were very high (i.e., > 0.9), corresponding well to the previous report. The correlation coefficients for the genes, however, showed remarkably lower values (i.e., < 0.6) than those for the arrays or overall experiments. Fig. 1 shows the density distributions of the correlation coefficients for genes (Fig. 1 (a) and (d)) and arrays (Fig. 1 (b), (c), (e), and (f)), demonstrating very different
distributions between MAS 4.0 and 5.0 reads (i.e., Fig. 1 (a) vs. (b) and (d) vs. (e), p<10 6 ). Correlation coefficient between MAS 4.0 and 5.0 reads for both banding and FVB experiments exhibited similar left-skewed density distributions. The x-axis ranges of Fig. 1 (b) and (e) were scaled from (c) and (e), respectively, to match those of (a) and (d), respectively, to help visual analysis.
The remarkable difference between the array-by-array correlation and the probe-by-probe correlation of MAS 4.0 and 5.0 can be very problematic. It is strongly suggested that even when the array-by-array correlation between MAS 4.0 and 5.0 is very high, it is not safe to believe that the correlation between matched probes may also be high. Moreover, the middle-range correlation coefficients of 0.5057 (0.2726) and 0.4867 (0.2926) between MAS 4.0 and 5.0 for our banding and FVB experiments, respec tively, suggest that an expression pattern of a gene across a certain set of conditions can be read quite differently by MAS 4.0 and by 5.0. It may again affect gene-gene pattern correlations that are massively used for the mainstream microarray data analysis such as clustering and microarray data mining. The high level of correlations between arrays may not guarantee the same high level of correlations
between genes or patterns in MAS 4.0 and 5.0
.
Although there is no such study like this, to our best knowledge, comparing different algorithms applied to the same oligonucleotide microarray data, several cross technology comparison studies were recently performed to evaluate the possibility of using data interchangeably across different platforms. Gene features such as GC-
content, sequence length, intra-chip sequence similarity, and average intensity were suggested to be possible candidates affecting the degree of correlations of matched pairs of genes (Kuo et al., 2002).
To determine gene-related features associated with the degree of correlation, we calculated correlation coefficients for different conditions. Fig. 2 shows scatter plots of average signal intensities (x-axis) and the correlation coefficients between MAS 4.0 and 5.0 (y-axis) of matched genes for banding (Fig. 2 (a)) and FVB (Fig. 2(b)) experiments. Average signal intensity is defined as the mean of all the intensity values (i.e., both AvgDiff values for MAS 4.0 and Signal values for MAS 5.0) across conditions. As shown in Fig. 2, the correlation coefficient between MAS 4.0 and 5.0 is suggested to be a function of average signal intensity.
Fig. 2 suggests that there may be more than one linear component in the association structure between the correlation coefficient and average signal intensity. MAS 4.0 and 5.0 provide presence call tags for each probe. “A” call means “Absent” call, meaning that the corresponding
target sequence may not be expressed. “P” and “M” calls mean “Present” and “Marginal” calls, respectively. According to the call tags, we stratified all the probes into predominantly “A”, “M”, or “P” calls. A probe is a predomi nantly “A” probe when the majority of the calls for the probe across all the experiments have “A” tags, and so on. The banding experiments showed no predominantly “M” call and the FVB only one predominantly “M” call. As shown in Table 2, there were 6,355 predominantly “A” calls and 3,688 predominantly “P” calls for the Banding experiment. FVB experiment showed 5,691 predominantly “A” calls and 4,351 predominantly “P” calls.
Table 2 demonstrates the correlation coefficients (means and standard deviations) between MAS 4.0 and 5.0 for the stratified subsets of both banding and FVB experiments. Although “A” call was originally designed to represent noisy signal, it shows reasonably high correlation in comparison to the “P” and overall calls. Moreover the slopes of the regression line of the predominantly “A” calls exceed those of the predominantly “P” calls for both experiments. It is suggested that “A” calls in the oligonucleotide microarray experiments may have more information than just noise. Density plots for the average intensities of predominantly “A” and “P” calls in Banding and FVB experiments are shown in Fig. 3.
Fig. 4 demonstrates the relationship between average signal intensity and the correlation coefficient between
MAS 4.0 and 5.0 for the matched probes in the pre dominantly “A” and “P” subgroups. For the predominantly
“A” and “P” calls of the banding experiments, the correlation coefficient between average signal intensity and the correlation coefficients of MAS 4.0 and 5.0 was 0.3073 and 0.2108, respectively. For the predominantly “A” and “P” calls of the FVB experiments, the correlation coefficient between average signal intensity and the correlation coefficients of MAS 4.0 and 5.0 was 0.0636 and -0.0849, respectively. One importance of this finding is that although the correlation between the old and the new algorithms seems to increase with signal intensity, it may show no or even negative correlations for highly expressed genes.
To evaluate to possible influence of the different algorithms, MAS 4.0 and 5.0, on the clustering solutions, we measured the concordance of clusters obtained from MAS 4.0 and 5.0 by hierarchical clustering, which is one of the most prominent mainstream analyses of microarray experiments, Jaccard and Kappa coefficients were computed as the measure of clustering concordance (Fig. 5). Although the MAS 4.0 and 5.0 results are basically the same data sets from the same set of scanned images, the concordance did not seem to near perfect. It is also demonstrated that although the clustering concordance of array-by-array clustering (Fig. 5(a)) may be moderately good, the concordance of probe-by-probe clustering (Fig. 5(b)) can be poor between the two sets of algorithms. Analysis for the FVB experiments showed similar results (data not shown).
To evaluate the degree of pattern similarity between the two results of MAS 4.0 and 5.0 readings, we tried to empirically measure the probability of the pattern pairs of a single gene to be in the same cluster in cluster analysis. We combined both results to create a double-sized data file and obtained clusters by K-means clustering with increasing number of clusters, one hundred times for each step. Then we counted the number of pattern pairs that are separated in their cluster membership. A pattern pair is a pair of gene expression patterns of a gene across a set of conditions (i.e., row-wise gene expression pattern) or of an
array including whole set of probes (i.e., column-wise whole array expression pattern), one of which is read by MAS 4.0 and the other by 5.0.
Fig. 6 shows the number of separated pattern pairs of row-wise clustering (black) and column-wise clustering (blue) for banding (Fig. 6 (a)) and FVB (Fig. 6 (a)) experiments. The numbers of objects (i.e., genes or arrays) were matched to objectively compare the degree of pattern similarity between the array (i.e., column-wise) and gene patterns (i.e., row-wise). That is, we selected 24 arrays (12 conditions * 2 (MAS 4.0 and 5.0)) for banding and 12 arrays (6 conditions * 2 (MAS 4.0 and 5.0)). Matched
numbers of genes were selected for both experiments. Each simulation was done with 300 times of random selections. Fig. 6 demonstrates that K-means clustering creates more separated pattern pairs in gene clustering than in array clustering. This finding corresponds well to our initial finding of poor probe-by-probe correlation in comparison to array-by-array correlation (Table 1). The high level of correlations between arrays may not guarantee the same high level of correlations between genes or patterns in MAS 4.0 and 5.0.
Array-wise clustering has often been used for molecular classification of diseases or cell lines (DeRisi et al., 1996; Tamayo eta/., 1999; Bittner et a/., 2000). Gene probe-wise clustering can be used for the functional analysis of genes and gene groups and their transcriptional regulations
(Heller eta/., 1997; Iyer eta/., 1999; Tavazoie etal., 1999). There have been series of efforts to share microarray experiment data from different experiments (Brazma et a/., 2001; Edgar et a/., 2002; Gardiner-Garden and Littlejohn, 2001; Strausberg et al., 2001). It is highly likely in the very near future that researchers may share enormous microarray data set for variety of meta-analyses such that having reliable standardized method to interpret the data becomes an important issue. This study suggests that data read by the old and the new algorithms could not be directly combined in the context of related series.
Clearly, our study and observations are limited to the particular experiments and the data set. Although our evaluation is limited to 54 Affymetrix GeneChip® arrays, our study is based on exactly the same data set. The difference, in theory, comes only in the reading algorithms, MAS 4.0 and 5.0. Two well-established mouse models of cardiomyopathies were experimented with experienced researchers. The analysis for the array-wise correlation corresponds very well to the original Affymetrix® report (Kuo eta/., 2002). Our finding is limited to the unexpectedly low correlation in gene probe-wise correlation analysis than in array-wise analysis and the observed gap between the two types of analyses.
The observed result revealed poor correlations, and in some instances, negative correlations. The results illustrates that even if both old and new algorithms show high correlation in the array-wise comparison, it dose not necessarily guarantee the same high level of correlation in gene probe-wise comparison. Because gene clustering with hierarchical and partitional clustering algorithms to reveal genomic-scale expression pattern analysis is one of the most reliably and frequently accepted analysis methodology (Eisen et al., 1998), the probe-wise correlation may be even more important and the reported high correlation in support of the new algorithms may mislead investigators to mix the old and the new analyses results without carefully evaluating the possible source of confusion in the following higher order analyses.
We also tried to quantitatively measure the impact of the discrepancies introduced by the new algorithms on the clustering solutions. Our result demonstrates that the concordance of clustering results for the two different algorithms may not be safely high and that the concordance of clustering may be even lower in the row wise clustering than the column-wise clustering (Fig. 5). Moreover, the similarity distance between pattern pairs of the same gene read by the two algorithms may be greater than some of the nearest patterns such that the pattern pairs, when mixed into a single data set, can be separated into different clusters (Fig. 6). The probability of pattern pairs to be separated in clustering analysis was higher in
gene clustering than array clustering.
Without further source of data, we are not in a position to determine which of these two versions is the better. There may be issues in both algorithms. Our research suggests that more research on the hybridization dynamics and signal intensity measures in variety of conditions is needed. At some point, researchers may have to agree on a standardized way of experimenting, preprocessing, measuring, storing, and analyzing microarray studies.
|