> top > docs > @ewha-bio:85

@ewha-bio:85 JSONTXT

Comparative Statistic Module (CSM) for Significant Gene Selection Comparative Statistic Module(CSM) provides more reliable list of significant genes to genomics researchers by offering the commonly selected genes and a method of choice by calculating the rank of each statistical test based on the average ranking of common genes across the five statistical methods, i.e. t-test, Kruskal-Wallis (Wilcoxon signed rank) test, SAM, two sample multiple test, and Empirical Bayesian test. This statistical analysis module is implemented in Perl, and R languages. Microarray technology is a powerful approach for revealing the patterns of coordinately regulated genes. Due to the considerable amount and intrinsic variation of microarray experiment data, statistical approaches have been used as a way to obtain useful biological information (Lobenhofer et al., 2001). Most common task in analysis of microarray data is to infer differentially expressed genes among two or more samples. Various statistical methods for identifying differential expression have been suggested to determine significant genes out of the whole list of genes. According to the various modeling assumptions underlying specific tests, statistical methods are divided into three classes; parametric, nonparametric, and semiparametric (Cui and Churchill, 2003). Different statistical methods using a single set of data may produce slightly or considerably different results due to specific statistical assumptions and their data dependency. A significant gene by one statistical method could be insignificant by another (Pan, 2002). No single statistic is universally optimal and there seldom exists any basis or guidance for a particular statistical method of choice. Therefore, it is important not only to select as many differentially expressed genes as possible but also to identify statistically meaningful genes. We implemented Comparative Statistic Module (CSM) as a module for previously developed cDNA Microarray data Analysis and Management System (cMAMS) (Kim et al., 2004). CSM provides genomics researchers with robust significant gene selection and suggests more reliable one among the five statistical methods by comparing the statistic ranking order of the common genes from different statistic results. The used statistical analysis methods are t-test (Lowry, 2004), Kruskal-Wallis test (also known as Wilcoxon signed rank test in case of two sample analysis) (Lowry, 2004), SAM (Tusher et al., 2002), two sample multiple test (Ge et al., 2003; Dudoit and Ge, 2004), and Empirical Bayesian model (Efron et al., 2001). A statistical test consists of two steps. The first step is to build a summary test statistic. The next step is to determine the significance level with the given test statistic. Usually, the significance level is constructed under the specified or estimated modeling assumption. In statistical analysis of microarray data, a relatively less sufficient sample size than other types of statistical data could mislead model assumption. The simplest method called ‘fold change determination’ for differential expression is to evaluate the log ratio by more than a certain arbitrary cut-off value. It is known to be unreliable for various reasons. Recently many sophisticated statistical methods have been proposed, but it still remains to be controversial how accurate these methods are. We considered the five well-known methods to identify the statistically significant genes and to choose a more reliable method of choice if needed. T-test is a widely used statistical method. However, its major problem is the strong assumption on the null distributions of the test statistics. The Wilcoxon rank-sum test is a nonparametric alternative to the two-sample t-test which is based on the order in which the observations from the two samples fall. When the assumption of the two-sample t-test holds, the Wilcoxon test is somewhat less likely to detect a location shift than the two-sample t-test is (Lowry, 2004). Two sample multiple test uses an adjusted p-value for complementing large multiplicity problems in which thousands of hypotheses are tested simultaneously (Dudoit and Ge, 2004). There are many criteria for calculating the adjusted p-value in two sample multiple test. But it is hard to know which criterion is the best. Efron and colleagues (2001) developed a simple Empirical Bayes test. This approach produces reliable a posteriori probabilities of activity differences for each gene, starting with a minimum of a priori assumptions. Significance Analysis of Microarray (SAM) assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, i.e. the false discovery rate (Tusher etal., 2002). CSM for differentially expressed gene is implemented using the aforementioned five statistical methods. These methods determine significant genes by two experimental groups. This module presents commonly selected genes first and proposes a method of choice among the five statistical methods by assigning ranks by calculating the average rank of the commonly selected genes (Table 1 and Fig. 1). Each statistical method lists up the significant genes and their ranks at a specified significance level. The specific rank of the same gene could be very different across the five statistical methods due to difference in specific statistical assumptions. Some significant genes by one statistical method could be insignificant by another, but some genes could be commonly listed in every test. We suggest that these common genes listed in every test would be more reliable than those listed in a single statistical test result. Experimental confirmation is desired to be performed first with these robustly listed common genes. A user inputs a data set containing gene expression profiles. The CSM module computes the rank of each gene in each test satisfying a certain significance level of p-value (e.g. p=0.05) which is given by a user. It outputs the commonly selected gene set including all significant genes in each test, average rank of each test, and method of choice rank among the five tests (Fig. 2). We provide the ranks of the five tests to decide a method of choice for given input data. We performed a sample run for the CSM module with ALL/AML data set (Golub et al., 1999). The result is shown in Table 1. The number of common genes is 663 and SAM is suggested as the method of choice with the given data. CSM helps to choose an optimal statistical method for significant gene selection by providing a brief summary of comparative result for the five statistical methods. The current version aims to increase the statistical significance of differentially expressed genes that might become interesting targets for further studies. The implemented CSM is added as a module to cMAMS with R sources ( http://www.r-project.org ). Despite the fact that p-value is not suitable for small sample size data as in microarray experiments, in which the sample size is less enough to ensure normal distribution, the four tests except SAM in this module use p-value as significance level measures. Future version will attempt to convert each p-value into corresponding False Positive Rate (FDR) to improve the statistical robustness for identifying common genes and to decide a method of choice in the five test methods.

projects that include this document

Unselected / annnotation Selected / annnotation
testing (0)