PMC:4979051 / 12152-12153
t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data
Abstract
Microarray data analysis typically consists in identifying a list of differentially expressed genes (DEG), i.e., the genes that are differentially expressed between two experimental conditions. Variance shrinkage methods have been considered a better choice than the standard t-test for selecting the DEG because they correct the dependence of the error with the expression level. This dependence is mainly caused by errors in background correction, which more severely affects genes with low expression values. Here, we propose a new method for identifying the DEG that overcomes this issue and does not require background correction or variance shrinkage. Unlike current methods, our methodology is easy to understand and implement. It consists of applying the standard t-test directly on the normalized intensity data, which is possible because the probe intensity is proportional to the gene expression level and because the t-test is scale- and location-invariant. This methodology considerably improves the sensitivity and robustness of the list of DEG when compared with the t-test applied to preprocessed data and to the most widely used shrinkage methods, Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA). Our approach is useful especially when the genes of interest have small differences in expression and therefore get ignored by standard variance shrinkage methods.
1. Introduction
Microarrays are a widely used methodology for measuring the expression of thousands of genes simultaneously. A common application of microarrays is to compare the expression levels of genes in samples drawn from two different experimental conditions in order to determine which genes are differentially expressed. Typically, a microarray dataset has in the order of ten thousand genes, whereas only a small subset of these genes is relevant, and the number of samples range from a few tens up to a few hundreds. Multiple processing steps are required in order to identify the genes of interest and errors in these steps compromise the reliability of the analysis. As a result, the comparison of the lists of differentially expressed genes (DEG) reported by different groups have revealed very small overlap [1,2]. Therefore, understanding the sources of this lack of robustness and the development of more reliable methods for the identification of the DEG remains a crucial issue.
The preprocessing of the raw probe intensity data constitutes the initial step in microarray data analysis and its goal is to infer a variable that represents the gene concentration. Usually, the preprocessing analysis is performed in three steps: normalization, background correction and summarization. First, the normalization is performed to reduce sources of variation of non-biological origin among the arrays in order to make them comparable. Next, in background correction, the background intensity due to non-specific hybridization and optical noise is inferred and subtracted from the normalized intensity. Finally, in the summarization step, the multiple probe intensities for each probe set is combined into a single gene expression value.
The probe intensity measure Y can be modeled as a combination of background B (due to optical noise and non-specific hybridization) and specific hybridization S [3,4]:(1) Yijg=Bjg+Sijg=Bjg+ϕjgθig where θig denotes the expression measure for the gene g in the ith sample, and the indexes i, j and g represent the sample, the probe and the gene, respectively. This model assumes that the intensity value Yijg increases linearly as θig increases, but the rate of increase of the expression θig is different for each probe j and is represented by ϕjg. It is also assumed that after normalization, the background Bjg is independent of the sample [5].
Background correction is the most critical step because errors associated with the inference of this variable strongly affect the genes with low expression values—therefore decreasing the statistical power of further analysis. In the Affymetrix platform it is common to include an extra probe—referred to as the mismatch—created by changing the middle (13th) base with the intention of directly measuring the effect of background noise. The methods dChip [3] as well as the Affymetrix methods MAS5.0 and PLIER [6] use the mismatch intensity as an estimate of the background. Therefore, Irizarry et al. [4] showed that subtracting the mismatch intensity from the perfect match intensity (Y) results in expression estimates with an exaggerated error, mainly for low expression values. Thus, they proposed a background adjustment step that ignores the mismatch intensities, named RMA [4], which uses the posterior mean E[S|Y] as a background adjustment.
Once a gene expression estimation is obtained, it is common to take the logarithmic transformation of this variable because the difference in transformed data, the fold change, is considered easier to manipulate and interpret. Some preprocessing algorithms return the expression concentration in log scale (like RMA and Plier) while others do not (like MAS5.0). It seems that there is no consensus about what scale should be chosen when using t-test or others ranking methods. Nevertheless, this choice is important because the t-test, for example, is not invariant under monotone transformations, i.e., the results of the test is different if x is replaced by log2x.
After preprocessing, several types of statistical tests can be applied in order to find the differentially expressed genes under two conditions. The independent t-test is the most popular statistical approach to select differentially expressed genes presumably due to its simplicity to implement and interpret. In this test, it is assumed that the data follows a normal distribution and, under the null hypothesis H0, the average of the gene expression in both experimental conditions are the same. A discordance of the data from what is specified in H0 can be quantified as the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed. This probability p is referred to as the p-value and a threshold value α can be chosen so that the hypothesis H0 is rejected if p≤α. Then, a statistical significance can be assigned, which is a statistical assessment of whether observations reflect a pattern rather than just chance. The t variable, which follows a Student’s t-distribution when the normality of the data holds, is defined as: (2) tg=〈θ^gi〉i∈A−〈θ^gi〉i∈Bsg where θ^ig is the inferred expression level by a preprocessing method and 〈θ^gi〉i∈A and 〈θ^gi〉i∈B are the expression averages over the samples i under the conditions A and B, respectively.
The empirical standard deviation is defined as (3) sg=a∑i∈Aθ^gi−〈θ^gi〉i∈A2+∑i∈Bθ^gi−〈θ^gi〉i∈B2 where a=(1/nA+1/nB)/(nA+nB−2) and the constants nA and nB represent the number of samples under the experimental conditions A and B, respectively.
Despite being widely used, the t-test has been subject to criticism in the literature since the error in preprocessed data tend to be asymmetric [7,8,9,10]. As a consequence, the variance estimation is dependent on the expression level because the genes with low expression levels are more affected by the errors in background correction. Because of that, modified versions of the standard t-test have been developed as alternative approaches [11,12,13,14,15]. Those approaches modify the t-test by using a procedure called variance shrinkage, which consists in modifying the denominator of the t variable by combining the gene-specific variance and a predictive variance.
The method Significance Analysis of Microarrays (SAM) [11] is the most popular alternative to t-test and it consists in a modification of the standard t-test by the inclusion of an extra variable s0, added to the pooled variance sg. The extra variable s0 is chosen in order to minimize the dependency of t on the expression level θ^g and the significant genes are identified comparing the modified t variable with a similar variable obtained under random permutations among the samples. Modifications of the t-test based on Empirical Bayes [13,14,16] have also been widely used and have been considered a good choice [9], and the method Linear Models for Microarray Data (LIMMA) [14] is the most widely used.
There is no shortage of more sophisticated alternatives to the t-test. However, given the widespread tendency to use the standard t-test, understanding the reason of its poor performance remains a crucial issue. Here, we show that the standard t-test can be used at the probe level, skipping background correction by using the normalized intensity data directly in the t-test. Several methods have approached this problem at the probe level [17,18,19,20], but all these approaches use background-corrected data. Our methodology outperforms the standard t-test using preprocessed data and the most used shrinkage methods SAM and LIMMA, therefore suggesting that background correction is a major source of error in microarray analysis. The methods were compared in terms of sensitivity by using the Affymetrix spike-in dataset, a commonly used benchmark, and in terms of robustness by using leukemia, breast cancer and multiple myeloma datasets.
2. Methodology
The standard t-test is scale- and location-invariant, i.e., the results of the test do not change if x is replaced by ax+b, where a and b are constants. Because of that, according to the model described in Equation (1), applying the t-test to the normalized intensity data is equivalent to applying the test on the concentration variable θ, as illustrated in the following manipulations:tjg=〈Yijg〉i∈A−〈Yijg〉i∈Bs(Yijg)=〈Bjg〉i∈A+ϕjg〈θig〉i∈A−〈Bjg〉i∈B−ϕjg〈θig〉i∈Bϕjgs(θig)=ϕjg〈θig〉i∈A−〈θig〉i∈Bϕjgs(θig)=〈θig〉i∈A−〈θig〉i∈Bs(θig)j (4)
Note that no log transform was performed in the data and that for each gene g a set of t variables (one for each probe) is obtained. In order to select the differentially expressed genes, we propose to take the median t-values, since the median is more robust than the average in the presence of outliers. Then, the genes can be ranked according to its relevance by the median t-values, which often is enough for selecting a subset of genes as biomarkers candidates.
To estimate the statistical significance of the median t-values of each gene, we define F(t) and f(t) to be respectively the cdf and pdf distribution of a t-variable. Then, for the case of a number of probes equal to n, the pdf distribution of the median t-value g(t) is given by [21]:(5) g(t)=Cn[F(t)]n/2[1−F(t)]n/2f(t)ifnisevenandCn=n!(n/2)!(n/2)!Cn[F(t)](n−1)/2[1−F(t)](n−1)/2f(t)ifnisoddandCn=n!(n−1)/2!(n−1)/2!
Now, for a given t-median value tm, a p-value can be estimated by integrating the pdf g(t) function for t
|