1 Background
Measurement of gene expression using DNA microarrays [1,2] have revolutionized biological and medical research. Since gene expression plays an important role in cell differentiation, development, and pathological behavior, computational analysis of DNA microarray data has the potential to assign functions to newly-discovered genes, unravel the structure of biological pathways, and assist in the development of new medicines. Biclustering has emerged as a powerful algorithmic tool for analyzing gene expression data. A bicluster in a gene expression data set is a subset of genes and a subset of conditions with the property that the selected genes are co-expressed in the selected conditions; these genes may not have any coherent patterns of expression in the other conditions in the data set. Biclusters have a number of advantages over clusters computed by more traditional algorithms such as k-means and hierarchical clustering [3]. Since a bicluster includes only a subset of genes and samples, it models condition-specific patterns of co-expression. Traditional clusters may miss such patterns since they operate in the space spanned by all the conditions. Further, many biclustering algorithms allow a gene or a sample to participate in multiple biclusters, reflecting the possibility that a gene product may be a member of multiple pathways.
A number of different methods have emerged for computing biclusters in gene expression data [4-16]; two papers survey these techniques [17,18]. These algorithms use different strategies to compute biclusters such as exhaustive enumeration [16,19,20], iterated improvement [5,6], repeated random sampling [11], and expectation maximization [12]. An issue all these algorithms deal with is trying to avoid outputting two or more biclusters with nearly the same set of samples and/or genes. A common approach is to remove a bicluster from the output if it shares a large fraction of genes and/or samples (based on a user-defined threshold) with an already computed bicluster. Another approach replaces the expression values in a bicluster with random values in order to prevent that bicluster from being computed again. In spite of these measures, biclustering algorithms may compute tens, hundreds, or even thousands of biclusters with varying degrees of overlap.
Organising, manipulating, and querying the potentially large number of biclusters computed by these algorithms is a data mining task in itself – one that has not been systematically addressed. In this paper, we develop a novel algorithm for laying out biclusters in a manner that visually reveals overlaps between them. We lay out the biclusters in a two-dimensional matrix whose rows (respectively, columns) are rows (respectively, columns) of the original dataset. We display each bicluster as a contiguous submatrix in the layout. We allow the layout to have repeated rows and/or columns from the original matrix, but we seek a layout of the smallest size. In addition, we develop a web-based search interface that allows the user to query the results for genes and samples of interest and visualise the layout of the biclusters that match the search criteria.
The layout algorithm is general enough to be applied to biclusters computed in real-valued, binary, or categorical data. For instance, the combination of biclustering algorithms and our layout algorithm can be used to analyze measurements of the concentrations of other types of molecules, including proteins and metabolites. We demonstrate our approach on two types of data. First, we compute layouts for biclusters extracted from leukaemia microarray data by the xMotif biclustering algorithm [11,21]. Second, we analyze protein-DNA binding data in S. cerevisiae and demonstrate how biclustering in combination with the layout algorithm can visually demonstrate differences in the transcriptional regulatory network that is activated in different growth conditions.
Figure 1 displays a layout computed by our algorithm on a toy binary matrix. Figure 1(a) displays a dataset in which rows represent dates and columns represent weather conditions in Blacksburg, VA, USA. A cell has a one (the cell is drawn shaded) if the weather condition corresponding to the cell's column (e.g., "Rainy" or "> 75°F") is true on the date corresponding to the cell's row. In this dataset, we define a bicluster to be a subset of rows and a subset of columns with the property that the submatrix defined by these rows and columns only contains ones. We computed all the closed biclusters in this binary matrix, i.e., biclusters with the property that every row (respectively, every column) not in the bicluster contains a zero in at least one column (respectively, one row) in the bicluster. In other words, it is not possible to add a row or a column to such a bicluster without introducing a zero. Figure 1(b) displays the layout computed by our algorithm of the seven biclusters in this dataset.
Figure 1  An example of a bicluster layout for weather data in Blacksburg, VA. Figure 1(a): a dataset in which rows represent dates and columns represent weather conditions in Blacksburg. Figure 1(b): the layout computed by our algorithm of the seven biclusters in this dataset. The bicluster layout problem, which we formally define in Section 3.1, is very similar to the hypergraph superstring problem studied by Batzoglou and Istrail in the context of physical mapping of genomes. Batzoglou and Istrail prove that the hypergraph superstring problem is MAX-SNP Hard, i.e., it is computationally intractable to obtain a bicluster layout whose size is smaller than a constant times the optimal size. In this work, we present a heuristic that minimizes the size of the layout well in practice. In the special case when there is a solution involving no repeated rows or columns, the algorithm computes the layout of smallest size. Our algorithm runs in O(mn2 + n2 log n) where n is the number of biclusters and m is the number of rows and columns in all the biclusters; the running time of the algorithm is independent of the size of the original dataset. We lay out the rows and columns of the biclusters independently. Our algorithm to lay out the columns is similar to a bottom-up hierarchical clustering of the column sets of the biclusters. At each stage, we merge two biclusters if the submatrix induced by them in the original matrix has the "consecutive ones property" (see Section 3.2). Finally we generate the two-dimensional layout by combining the row and column layouts.