> top > docs > PMC:545083 > spans > 797-804

PMC:545083 / 797-804 JSONTXT

A novel Mixture Model Method for identification of differentially expressed genes from DNA microarray data Abstract Background The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. Mixture model method (MMM hereafter) is a nonparametric statistical method often used for microarray processing applications, but is known to over-fit the data if the number of replicates is small. In addition, the results of the MMM may not be repeatable when dealing with a small number of replicates. In this paper, we propose a new version of MMM to ensure the repeatability of the results in different runs, and reduce the sensitivity of the results on the parameters. Results The proposed technique is applied to the two different data sets: Leukaemia data set and a data set that examines the effects of low phosphate diet on regular and Hyp mice. In each study, the proposed algorithm successfully selects genes closely related to the disease state that are verified by biological information. Conclusion The results indicate 100% repeatability in all runs, and exhibit very little sensitivity on the choice of parameters. In addition, the evaluation of the applied method on the Leukaemia data set shows 12% improvement compared to the MMM in detecting the biologically-identified 50 expressed genes by Thomas et al. The results witness to the successful performance of the proposed algorithm in quantitative pathogenesis of diseases and comparative evaluation of treatment methods. Background Recently, microarray technology has provided the means for simultaneous screening and analysis of thousands of genes. Although an enormous volume of data is being produced by microarray technologies, the full potential of such technologies cannot be accessed without the ability to sift through the noisy signals to obtain useful information. The large data sets produced by microarray technology have resulted in the need for reliable, accurate, and robust methods for microarray data analysis. A major challenge is to detect genes with differentially expression profile across two experimental conditions. In many studies, the two sample sets are drawn from two types of tissues, tumours or cell lines, or at two time points during the course of a biological processes. The computationally simple methods used for such analysis, including the methods of identifying genes with fold changes (such as the popular Log-ratio graphs) [1], are known to be unreliable due to the fact that in such methods the statistical variability of the data is not properly addressed. While various parametric methods and tests such as two-sample t-test [2] have been applied for microarray data analysis, strong parametric assumptions made in these methods as well as their strong dependency on large sample sets restrict the reliability of such techniques in microarray problems. The nonparametric statistical methods, including the Empirical Bayes (EB) method [3], the significance analysis specialized for microarray data (such as SAM [4]) and the mixture model method (MMM) [5] have been specialized and applied for microarray data analysis. It is claimed and argued that the new extensions of the MMM are among the best methods producing biologically-meaningful results [5,6]. In this paper, without ignoring the potential applicability of non-parametric methods in microarray processing applications, due to the claims made in [6], we have restricted the comparison of our methods only to the MMM based methods. The major disadvantages of the above-mentioned methods, especially the MMM, include the lack of repeatability of the results under different runs of the algorithm, and the sensitivity of the algorithm on parameter initialization. A reliable microarray analysis method must be reproducible and applicable to different data sets under different experimental conditions. More specifically, an accurate microarray processing method is expected to pinpoint, with a relatively high degree of accuracy and robustness, genes with elevated expression levels that are related to the experimental condition in all runs. The main objective of this paper is to design and test an extension of the MMM whose results are reproducible, more biologically meaningful, and significantly less sensitive to the models' parameters. The paper is organized as follows. In Algorithms section, a review of the MMM and its recent extensions, Mod2MMM, together with the detailed description of the proposed method are given. In Results and Discussion section, the K5M algorithm is first applied to the well-studied Leukaemia data set [7] that is often treated as a benchmark problem to compare different algorithms with each other. Once the desirable performance of the proposed algorithm is verified against the Leukaemia data set, the algorithm is applied to a new data set [[8-14] and [15]] that deals with the pathogenesis of Hypophosphatemia, which is a common X-linked metabolic bone disorder in human and mouse. Finally, the Conclusion section is in the end. Algorithms MMM & its recent extensions We start this section with a brief review of the existing MMM based techniques. Consider Yij as the expression level of gene in array i (i = 1, ..., n; j = 1, ..., j1, j1 + 1, ..., j1 + j2), where the first j1 and last j2 arrays are obtained under two conditions. A general statistical model for the resulting data is: Yij = ai + bixj + εij     (1) Where xj = 1 for 1 ≤ j ≤ j1 and xj = 0 for j1 + 1 ≤ j ≤ j1 + j2. In addition, εij is a random error with mean 0. From the above formulation, it can be seen that ai + bi is the mean expression level of the first condition, and ai is the mean expression level of gene i in the second condition. The method requires that both j1 and j2, the number of data sets for each experiment condition, be even. The t-test statistic type scores (2) and (3) are calculated on the pre-processed data. Here, ai is a random permutation of a column vector that contains j1/2 1's and j1/2 -1's and bi contains j2/2 1's and j2/2 -1's. Since the data are not assumed to be normally distributed, the distribution functions f0 and f are estimated as in (4) and (5), respectively. The null distributions, f0 and f, are estimated directly in a nonparametric model for gene expression data. Where φ(z; μi, Vi) symbolizes the normal density function with mean μi, variance Vi, and the mixing proportions πi define the linear combination of the normal basis function. We use Φg0 to represent all unknown parameters {(πi, μi, Vi): i = 1, ..., g0 } in a g0-component mixture model. The number of normal basis functions, i.e. g0 can be estimated by the EM algorithm, which maximizes the log-likelihood function of (6) to obtain the maximum likelihood estimation of . Within K iterations, the EM algorithm is expected to find the local maxima for all unknown parameters. It is recommended to run the EM algorithm several times with various random starting parameters and choose the final estimate as the one resulting the largest log-likelihood [6]. As mentioned above, using random starting points causes the result of the MMM instable and avoids reproducibility of the results. More specifically, in each run the MMM algorithm may give different number of expressed genes, which is not desirable in biological studies. This issue will be addressed in our proposed method. After finding the optimized for different g0 's, the algorithm selects the sub-optimal g0 corresponding to the first local minimum of BIC or AIC [16]. where vg0 is the number of independent parameters in Φg0. Then, the algorithm uses the resulting g0 as the number of normal functions to fit f0. The same procedure is performed to estimate the sub-optimal number of normal functions to estimate f. As mentioned above, with the fixed number of normal functions, the parameters of functions f and f0 are iteratively updated for a number of iterations. When the iterations are terminated, the likelihood ratio is estimated based on the final estimations of f0 and f: LR(Z) = f0(Z) / f(Z)     (9) A bisection method [17] with a Bonferroni adjustment is used to determine the cut-off points [18] for decision-making. This means that for a threshold value s, if LR(Z)

Document structure show

projects that have annotations to this span

There is no project