PMC:550659 / 19133-25224 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"15705192-11309499-8397839","span":{"begin":1787,"end":1789},"obj":"11309499"},{"id":"15705192-12184807-8397840","span":{"begin":3745,"end":3747},"obj":"12184807"},{"id":"15705192-12912826-8397841","span":{"begin":4827,"end":4828},"obj":"12912826"},{"id":"15705192-14960458-8397842","span":{"begin":4829,"end":4830},"obj":"14960458"}],"text":"Discussion\n\nImpact of processing method choice\nThe choice of processing method for Affymetrix array data evidently has a major impact on the ability to confidently report the results of differential expression analysis. The effect is greater, for example, than the choice of using a robust or a non-robust analysis, even in the colon data where robust analysis results in substantial improvements. Differences among processing methods are much greater in the more challenging colon data set compared to the ovary data, yet it should be noted that the sample sizes in the colon data are not atypical in real investigations.\nWhile results from two data sets can never conclusively determine the optimal method, it is notable that across both data sets, using both t-statistic and rank-sum analyses, there is a high degree of similarity in the rank ordering of the methods from the best to the worst performer. The trimmed mean (TM) and Dchip methods consistently perform as well or better than any of the other methods. A possible explanation for this is that the weights used by the Dchip may tend to downweight the least and greatest PM-MM differences, just as the TM method excludes these differences.\n\nInterpretation of FDR comparisons\nWhen comparing array processing methods using experimental data in which the identities of differentially expressed genes are unknown, great care must be taken to ensure that apparent differences in sensitivity are not due to other factors. One critical point is that the null distribution providing the expected number of false positives at a given test statistic threshold (the numerator of the FDR) must fairly reflect the statistical behavior of null genes. Permutation approaches have been extensively used to produce empirical p-values (e.g. [14]) and were used by Efron et al. [9] to estimate FDR values. Although permutation approaches are known to be slightly biased for estimating the FDR, the size of the bias (e.g. as shown in figure 5 of Efron et al. [9]) can not explain the magnitude of differences found here. In addition, for a comparative analysis, as carried out here, it is more crucial that the biases be relatively constant across the methods. However, since permutation approaches may not be highly accurate when the sample size is small, it is important to check performance on multiple data sets before conclusions about performance are drawn.\nWhile we have focused on FDR as the basis of comparison, the pursuit of small FDR values is not the only desirable operating characteristic of an array processing method, and other reports have also emphasized the accuracy of estimating the precise size of concentration differences. However to the extent that most actual studies seek to find differential expression between groups, the use of small FDR values seems more instrumental as the basis for judging methods.\n\nVariation due to choice of test statistic\nAlthough our primary aim was to investigate variation in sensitivity due to the seven processing methods, all analysis was carried out independently for two test statistics. The t-statistic is widely used in practice, but is well-known to be sensitive to outliers, particularly when the sample size is small. We found that certain processing methods, particularly EB-GCRMA, had a tendency to produce outlier expression values in the colon data set. Thus the combination of using the EB-GCRMA method with t-statistics in the colon data led to particularly poor performance.\n\nVariation due to log transform and array normalization\nIn practice, the approach used for array normalization and for forming log-transformed expression values may be equally or more influential than the method used for producing probe set summaries [15]. In this study, we used implementations of the seven processing methods as prepared by their developers, and thus array normalization and and log-transforms were applied in a method-specific fashion. This provides a comparative analysis of the various methods as they are used in practice, which is most directly relevant since few investigators will override the default normalization and log-transform methods provided by the developers of each method.\nNevertheless it remains of interest whether these routine processing steps are the determining factor of performance. In a future study it will be important to investigate this question further by modifying the implementations of the processing methods so that uniform log transforms and array normalizations are applied.\n\nComparison of methods using data from disease profiling data sets\nA key point that we advocate in this work is that false discovery rates in actual disease profiling data constitute a valuable complement to benchmarking results obtained from spike-in, dilution series, and mixture experiments (e.g. [4,5]). The primary obstacle that must be overcome is that proper null sampling distributions are essential to ensure that the methods are compared on a common basis. Since numerous data sets covering a wide range of Affymetrix platforms are available, to the extent that multiple data sets are in agreement about relative performances it is unlikely that the randomization procedure used to calculate FDR values is systematically biased against a particular method.\nIn spite of the statistical challenges in using disease profiling data for benchmarking, we argue that these data sets also offer some unique advantages. Calibration data sets are relatively few in number and are not available for all platforms. Newer platforms in particular are under-represented. Therefore overtraining to the available calibration data through manipulation of the many tuning parameters in the more complicated processing methods is an unavoidable concern. In addition, the calibration data sets likely do not represent the same degree of challenge as disease profiling data in that reproducibility of fold changes for affected and unaffected genes is quite high compared to data from, say, human tissues where a large number of uncontrolled sources of variability are present.\n"}