3.1. The Effect of Normalization on the Number of Differentially Expressed Genes
Table 2 shows the numbers of differentially expressed probes and genes for the different datasets. In some datasets there are multiple sub-groups and so these datasets have multiple comparisons in order to determine the variation in expression levels. It is also important to note that the number of genes identified by their EntrezID number is larger than the number of probes identified on the array. This is because there is a one-to-many relationship between the probes and EntrezID. This is far from ideal as this means that there are cross-gene effects in the array.
From the table it is clear that the method used for normalization makes a difference to the number of differentially expressed genes and sometimes this can be quite large, amounting to over 1000 genes or about 20% of the total number. This could have a significant impact on any subsequent gene set enrichment analysis.
There is also a general trend to a larger number of differentially expressed genes in the larger datasets. This reflects the higher power (improved sensitivity) of larger datasets, and the much smaller E‑GEOD-6044 has a particularly small number of variable genes. However, the p-value distribution (Figure 1) also suggests that in the larger datasets noise is becoming an issue and that there are a large number of supposedly differentially expressed genes might be an artifact of those datasets. The E‑GEOD-18842 and E-GEOD-43458 datasets both perform very well in the subsequent pathway analysis compared to the datasets where it was necessary to introduce a cut-off to reduce the number of genes being considered. Both of those datasets contained only two sub-groups and the division of the data into further sub-groups unless these were part of the original experimental design is highly controversial and likely to result in an increased number of false positives. Only the E-GEOD-50081 dataset suggests that there are a large number of differentially expressed genes between adenocarcinoma and squamous cell carcinoma, but this dataset was specifically created in order to ask this question.
microarrays-03-00212-t002_Table 2 Table 2  The number of differentially expressed probes and genes (EntrezIDs) between the two specified conditions for each of the datasets normalized using rma, gcrma and farms. In cases where the cut-off of 2000 probes was used the number of EntrezIDs are given for these cut-off values.

3