Gene Expression Pattern Analysis via Latent Variable Models Coupled with Topographic Clustering
We present a latent variable model-based approach
to the analysis of gene expression patterns, coupled with topographic clustering. Aspect model, a latent variable model for dyadic data, is applied to extract latent patterns underlying complex variations of gene expression levels. Then a topographic clustering is performed to find coherent groups of genes, based on the extracted latent patterns as well as individual gene expression behaviors. Applied to cell cycle- regulated genes of the yeast Saccharomyces cerevisiae, the proposed method could discover biologically meaningful patterns related with characteristic expression behavior in particular cell cycle phases. In addition, the display of the variation in the composition of these latent patterns on the cluster map provided more facilitated interpretation of the resulting cluster structure. From this, we argue that latent variable models, coupled with topographic clustering, are a promising tool for explorative analysis of gene expression data.
DNA microarrays from cDNA chips or oligonucleotide chips provide a global, parallel view on the expression patterns of hundreds or thousands of genes in a cell at a specific time, under a specific experimental conditions or processes (Baldi and Hatfield, 2002; Shamir and Sharan, 2002). They are among the most powerful and versatile tools for
functional genomics, and advances in technologies for these high-density arrays are enabling to produce enormous amount of expression data. For the efficient exploration and analysis of these massive microarray data, appropriate analysis tools are needed which are different from conventional methods for the traditional one-gene-in- one-experiment paradigm.
Clustering is a key step to analyzing gene expression data, where genes are systematically grouped together according to their similarity in expression patterns. Among the earliest and extensively used algorithms are hierarchical clustering (Eisen et al., 1998; Spellman et al., 1998; Scherf et al., 2000) and k-means (Herwig et al., 1999; Tavazoie et al., 1999). Eisen et al. (Eisen et al., 1998) analyzed the gene expression data of budding yeast Saccharomyces cerevisiae using the hierarchical clustering algorithm and showed that genes of known similar functions were grouped together with the clustering. Tavazoie et al. (Tavazoie et al., 1999) used the k-means algorithm to identify transcriptional regulatory sub-networks in yeast. Despite their successful applications in many other biological tasks as well as these, however, the algorithms suffer from some shortcomings, such as lack of robustness, incompetence of proper handling of global structure (hierarchical clustering) and non-structure in resulting clusters (k-means clustering) (Tamayo et al., 1999; Fowlkes eta/., 2002).
In this paper, we present an effective approach for gene expression pattern analysis, based on latent variable models and topographic clustering. Latent variable models are a powerful tool for discovering latent structure underlying data objects. They are well suited to extract correlational patterns of variables and provide a compact, useful representation of data. Also, a natural similarity metric between data can be derived from latent variable models (Jaakkola and Haussler, 1999; Tsuda et al., 2002). Topographic clustering algorithms such as SOM (self organizing maps) (Kohonen, 1997) are an attractive method for providing not only good clustering performance but also easy visualization and interpretation of the results (Tamayo et a/., 1999; Toronen et al., 1999).
We use an aspect model that is a latent variable model for co-occurrence data to extract significant patterns underlying gene expression data. Using these patterns and original expression data, two cluster maps are produced from a topographic clustering. In the first map, each cluster is represented by the average expression levels of genes
in the cluster. The second map summarizes the clusters in terms of its characteristics in convex combinations of latent patterns.
We applied our method to the expression data of cell cycle- regulated genes of the yeast Saccharomyces cerevisiae. Our results show that the latent patterns discovered by the aspect model are well reflecting the characteristic expression behaviors across various cell cycle phases. Two independent cluster maps provided by topographic clustering are shown to be useful in their ability to provide an intuitive, integrated understanding of complex variations of hundreds of genes during the progress of cell cycle.
Our approach to gene expression analysis consists of three steps: (1) identification of meaningful patterns inherent in gene expression data using a latent variable model; (2) construction of a similarity matrix containing similarity values of all gene pairs; (3) clustering of genes based on the similarity matrix and a topographic clustering algorithm. Fig. 1 summarizes the overall procedure of our approach to gene expression data analysis.
As the first step to the analysis of gene expression data, we use an aspect model (Hofmann, 2001) which is a latent variable model for co-occurrence data. Latent variable models are a powerful approach to probabilistic modeling where a set of observed variables are supplemented with additional latent variables (Bishop, 1999). Besides their efficiency for density estimation, the latent variable models
are very useful in discovering latent structure underlying set of data.
Let D be an Nx M matrix where M is the number of data and N is the number of attributes and each element is indexed by (a,, x), (1< i<M, 1 <j<N). In the aspect model, the joint probability of P(aj. x.) is decomposed by introducing a latent class variable zCZ={z/,z2, ■■■,»},
This indicates that the conditional probability or sample specific attribute distribution P(aj I x)is approximated by a
convex combination of K aspects P(a } I <x). The goal of the
modeling by the aspect model is to estimate class-specific attribute distribution P(aj I and data specific class distribution P(& I x.) from data set.
Formally, the aspect model is learned by maximizing the log data-likelihood L=log p(D):
where r(aj, x<) is the value of a, in the data x. The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is used to estimate the parameters maximizing L. In
the E-step, the posterior probabilities P(zt I a Jt x.) are
computed and in the M-step, the parameters P(aj I &) and Pfa I x) are estimated. These two steps are iteratively alternated until convergence.
One of the most important things in clustering algorithms is how to define (dis)similarity measure between data, and the choice is quite subjective. In kernel methods like support vector machines (Cristianini and Shawe-Taylor, 2000; Scholkopf and Smola, 2001), this corresponds to selecting an appropriate kernel function.
Given two data vectors x and x in the input space X, a valid kernel function fcfx, xj can be represented as the inner product between two feature vectors in a (high dimensional) feature space F, that is,
where <p is a fixed mapping from X to F, that is, 0 :X-^F. The mapping 0 can be viewed as a preprocessing step to draw out meaningful features from data, and the technique of kernel function provides an efficient way of inner-product calculation in the feature space F (Cristianini and Shawe- Taylor, 2000; Graepel eta/., 1998).
In this paper, we utilize the kernel function presented in (Hofmann, 2000), where an intrinsic kernel, named the Fisher kernel, was derived from the aspect model. The Fisher kernel (Jaakkola and Haussler, 1999) is an approach to constructing kernels from generative probabilistic models and defines the similarity between data based on information-geometric principles using the gradient space of the generative model.
Let Z(x) be the expected log-probability of a sample x under the aspect model parameterized by 0?} where 0i={P(zt) and 02={P(at\zk)}. The kernel is derived by computing the Fisher scores ^l(xi) and the Fisher
information matrix / (0)=E{Vol (x)VFi ( x )}. With the square root transformation of parameters 0={0i,0 2 } and the
approximation of 1(0) as the identity matrix (Hofmann,
2000), the kernel from the aspect model is given by this equation.
Here, p(a, x,) in the Equation (6) is an empirical distribution estimated by ,Eventually, this similarity metric is the linear summation of ki and with equal weights: ki(xi, x) is the similarity in the representation on the
latent space. fo(x, x) calculates the inner product of the original two data x< and x>, where products of the corresponding attributes are weighted by the degree of overlap of respective posterior probabilities (the second line in Equation (6)) (Hofmann, 2000).
The kernel-based soft topographic mapping (STMK) (Graepel et al. 1998) is a topographic clustering algorithm based on kernel-based distance measures and principles from statistical physics. It can provide not only a stable and good clustering algorithm, but also a topographic map of the clustered data. By using a specific kernel function and the kernel trick, the algorithm can also perform clustering efficiently in a feature space related to the original data space in a nonlinear way, compared to the basic SOM. In the SMTK, clustering is defined in terms of an optimization problem. The cost function E to be optimized is given as
where M is the number of data and C is the number of clusters. The binary variable ma indicates whether the zth sample belongs to the Jth cluster, and ea is the error occurred by assigning the /th sample to the jth cluster. Given a mapping <f> from input space X to a feature space F as explained above, the partial assignment cost ea is defined as
where </> (x.) and a are a sample and a cluster center in a feature space F, respectively, fa is a neighborhood function between jth and /th clusters, and determines the coupling between the two clusters as in SOM.
The STMK algorithm provides an efficient procedure to find a good solution to the minimization of the cost function in Equation (7), based on deterministic annealing (Rose et al., 1990) and the EM algorithm. Starting with a random initialization of ea for all data and clusters, the algorithm iteratively updates the parameters using the EM algorithm with some temperature-annealing schedule. In the E-step, expectation values of ma, the probability that the ith sample is assigned to the jth cluster, is estimated for each pair of data and clusters. Then all the parameters related with the calculation of cluster centers in a feature space F are updated in the M-step.
We applied our method to the yeast Saccharomyces
cerevisiae cell cycle data (Spellman et al., 1998). This dataset (available at http://cellcycle-www.stanford.edu ) contains the time series of relative expression measurements of more than 6,000 genes from cell cultures synchronized by three independent methods. In the first set, the a mating factor pheromone was used to arrest cells in G1 and samples were taken in every 7 minutes throughout 140 minutes. In the second set, centrifugal elutriation was used to collect small G1 cells in every 30 minutes during 6.5 hours. In the third, cdc15 temperature sensitive mutant was used to arrest cells in late mitosis and samples were taken in every 10 minutes during 300 minutes. Additionally, samples from arrest of a cdc28 temperature-sensitive mutant (Cho et al., 1998) were included in the dataset. The analysis of these data sets enables researchers to identify cycles or waves of expressions that are meaningful in biological processes. Spellman and his colleagues (Spellman, 1998) have identified 800 genes whose expressions are cell cycle- regulated using their periodicity and correlation analysis methods. Among the 800 yeast genes, we selected 700 genes with at most 4 missing entries over 73 conditions: 18 time points over a arrest, 24 over cdc15 arrest, 17 over cdc28 arrest, and 14 over elutriation. For each gene, the missing values were filled by the average values of the either side of the time points, assuming the smooth variation of gene expression levels during the cell cycle. Finally, the dataset is represented by a 73 x 700 matrix where each gene is represented by a column vector
containing 73 attributes of gene expression levels.
Initially, we applied the aspect model to extract latent patterns inherent in the gene expression variation during the progress of cell cycle. In the aspect model, observed values are assumed to be count-valued. Prior to the analysis, therefore, all the expression values va (1< i< 700, 1 < j< 73) were first transformed to positive integer values by rfa, ai) = [100 x (1 /(1 +exp(-vji)))] 1 .
The yeast cell cycle data was analyzed with varying number of latent factors z={zi, zz, ■■■, z*}, from K=5 to K=10. The result with K=6 is shown in Fig. 2, where the expression patterns are plotted across the 18 time points from the ^-factor experiments. It can be seen that different periodicities of gene expression are captured in the six patterns. Compared with the cell cycle phase described in (Spellman etal., 1998), these patterns relatively well reflect peak expression behaviors in respective cell cycle phases: M/G1-phase (patterns 1 and 2), G1 -phase (pattern 3), G1/S-phase (pattern 4), S/G2-phase (pattern 5), and M- phase (pattern 6).
Additionally, we clustered the 700 genes using the learned aspect model in Fig. 2 and the STMK algorithm.
The kernel function in Equation (4) was used to calculate similarities between two genes. Fig. 3 shows the resulting cluster map with 36 clusters on the 6 x 6 grid. Each cluster is represented by the average expression pattern of genes in the cluster. Different periodicities in gene expression are shown across the clusters on the grid and the adjacent clusters have similar patterns.
Cluster structure is also presented in terms of latent patterns (Fig. 4). A gene x is mapped into the latent space produced by the aspect model and is represented by a vector of probabilities, yrfelUJ’MV-^k)),
Then cluster centers Ck (1 < k <36) in the latent space are expressed as linear combinations of y by
where a* is the contribution of the zth sample to the cluster k. The genes in a cluster are now distinguished from those in other clusters by the combination characteristics of the latent patterns. The cluster 1 of the top-left-most, for example, shows the highest probabilities in the pattern 3.
The third pattern in Fig. 2 characterizes genes of which the first peaks of their expression are observed at the measurement points around 21 minutes from the start of the «-factor experiments. This is a typical expression behavior of the genes in the G1 phase group identified in (Spellman et al., 1998), and all 45 genes {CDC45, CLB5, MSH2, MSH6, etc) in the cluster 1 are found in the G1 phase group. Nearby clusters of cluster 1 in Fig. 4 shows the similar combination patterns, and most of the genes in those clusters are found in the G1 phase group. Similar analysis is possible for clusters 6, 31, 36: patterns 4 and 5 are dominant in cluster 6, patterns 1 and 2 in cluster 31, and pattern 6 in cluster 36. Genes in those clusters are found in the phase groups of which typical expression characteristics are similar to the dominant patterns. 35 genes (HHF1, HTA1, HTA2, HTB2, etc) among 39 in the cluster 6, 21 (ACE2, CLB1, CLB2, M0B1, etc) genes among 24 in cluster 31, and all 35 genes (ASH1, CDC46, MFA2, SIC1, etc) in cluster 36 belong to S phase group, M/G1 phase group, and G2/M phase group, respectively. The constitution of all the clusters in terms of phase groups identified by Spellman etal. are described in Table 1.
A closer examination of Fig. 4 provides another
interesting interpretation. When we go from cluster 1 to cluster 6 along the top margin, for example, we can see that pattern 3 declines steadily and pattern 5 rises instead. This shows the distinction among genes in different clusters in terms of the peak time in their expression,
where the first peak time of the genes with higher probabilities of pattern 3 is earlier than that of the genes with the opposite combination. Referred to the G1 phase group ordered by their peak times (Spellman, 1998), genes in cluster 1 and cluster 2 are found in the earlier part than
those in cluster 3 and 4. Most of the genes in cluster 5 are located in the boundary of G1 phase group and S phase group, and most of those in cluster 6 are found in the S phase group. Similar tendencies are observed in subsequent routes: from cluster 6 to cluster 36 along the right margin, from cluster 36 to cluster 31 along the bottom margin (Table 1). In this way, a proper discovery and exploitation of the inherent structures in gene expression behaviors helps us for further understanding of expression time-series data.
We have presented an effective approach for gene expression pattern analysis, based on latent variable models and topographic clustering. Applied to the expression analysis of the cell cycle-regulated genes of the yeast Saccharomyces cerevisiae, various distinctive patterns underlying the data set were extracted by an efficient latent variable model, that is, the aspect model. The identified patterns well correspond to the typical expression behaviors of genes regulated in specific cell cycle phases. In topographical clustering, based on the STMK algorithm and a similarity measure derived from the aspect model, the discovered gene clusters are well reflecting their characteristic expression patterns during the progress of cell cycle.
In addition, our apprach has provided two informative maps on the cluster structure. The first map depicts each cluster by the expression variation of genes in the cluster, across the measured time points. In the second map, the clusters are depicted by the combination patterns of latent properties inherent in the data. A meta-level analysis and visualization based on these coupled maps enables a more principled interpretation of gene expression patterns. As future works, we are considering to incorporate another type of information for gene expression data analysis. The learning in latent variable models is basically unsupervised. In some cases, however, existing labeled data can be used during training to improve performance. For the yeast data, for example, more than 100 genes have been already identified as being cell cycle-regulated. Hierarchical latent variable models and semi-supervised learning in the model could be applied to utilizing this information. Additionally, we will study on the selection of appropriate number of latent factors of aspect models in the analysis of gene expression data. In probabilistic generative models like aspect models, this problem can be reduced to a model selection problem. Then, standard techniques from statistics, based on various criteria such as BIC and MDL, could be used in our works.
|