4 Experimental results We present results for three types of data. We first evaluated our method on synthetic datasets. Next, we considered a binary data set encoding results of ChIP-on-chip experiments in S. cerevisiae. Finally, we used our method on gene expression data to distinguish differences between two types of leukaemia. 4.1 Synthetic data We created synthetic datasets with different numbers of rows and columns. For each dataset, we generated biclusters by sampling subsets of rows and columns. For this experiment, we randomly generated the number of rows and columns and identifiers for the rows and columns; we did not need to generate values for the cells of the matrices. For each set of biclusters, we recorded the time required to run our layout algorithm and the number of rows and columns in the computed layout. For each layout, we estimated the efficiency of the layout as the ratio of the size of the layout to the size of the dataset. Lower values of efficiency are better than higher values, since they indicate that the algorithm is able to exploit overlaps between biclusters. For each choice of number of rows in the dataset, number of columns in the dataset, and number of biclusters, we averaged the results for 100 runs. Tables 1 and 2 display our results. Efficiency values may be less than one, e.g., when some rows or columns in the dataset do not belong to any bicluster. Table 1 Execution times (in seconds) for the layout algorithm on synthetic matrices #biclusters #rows + #columns in the dataset 10 30 50 70 90 20 0.168 0.328 0.462 0.52 0.532 40 1.23 2.514 3.046 3.574 4.008 60 4.074 7.992 11.238 11.71 12.81 80 9.484 19.586 25.546 29.652 29.446 100 17.982 37.966 48.418 50.916 56.112 Table 2 Efficiency values for the layout algorithm on synthetic matrices. # biclusters #rows + #columns in the dataset 10 30 50 70 90 20 0.184 0.842 1.316 1.254 1.428 40 0.304 1.16 1.632 2.04 2.074 60 0.398 1.496 2.262 2.26 2.508 80 0.512 1.65 2.358 2.726 2.698 100 0.48 1.808 2.582 2.686 2.996 4.2 Transcriptional regulation in S. cerevisiae To demonstrate the ability of our visualization algorithm to highlight differences between biclusters in similar datasets, we analyzed datasets of transcriptional regulation in two experimental conditions in S. cerevisiae [30,31]. Each dataset is a binary matrix whose columns represent transcription factors and whose rows represent genes in S. cerevisiae. A matrix entry contains a one if a ChIP-on-chip experiment indicates that the transcription factor binds to the promoter of the gene with a p-value at most 0.001. An important problem that arises in the analysis of this data is determining if a set of genes are collectively regulated by a set of transcription factors and whether this combinatorial regulation changes when the cell is exposed to stress. Although ChIP-on-chip data is noisy and significant effort may be needed to clean it up, the analysis we present next demonstrates that a combination of biclustering and our layout algorithm yields biologically useful results. The two protein-DNA datasets we study correspond to the growth of S. cerevisiae cells in rich medium [31] and to growth under exposure to rapamycin [30], a condition that mimics nutrient starvation. We restricted our attention to transcription factors studied in both papers. We ran our implementation of the Apriori algorithm [32] that computes closed biclusters (as defined in Section 1) on both these datasets, applied our layout algorithm on biclusters with at least two genes and at least two transcription factors, and obtained the layout in Figure 4(a). Biclusters obtained from the data under growth in rich medium are shown as blue boxes and rapamycin-induced biclusters are shown as red boxes. A cell in the figure is dark grey (respectively, light grey) if the transcription factor binds to the gene's promoter in both (respectively, one) condition. The image strikingly demonstrates that under exposure to rapamycin, the transcriptional regulatory network activated in the cell is very different from the network activated under growth in rich medium. The rich medium data contains only four biclusters involving these transcription factors while the rapamycin data contains 38 biclusters. We conclude that very few genes are co-regulated by the same set of transcription factors in both conditions. Figure 4 Bicluster layouts. Visualizations of the layouts computed by our algorithm. Since the layout may contain repeated rows and columns, a bicluster may appear at multiple locations in the layout. We only highlight only one occurrence of each bicluster. The layout on the left displays biclusters representing combinatorial control of transcription in S. cerevisiae. The layout on the right displays biclusters in gene expression data for ALL and AML. To illustrate the use of our web interface, we used it to search for biclusters that included the transcription factors RTG3 and GLN3. RTG3 is a transcription factor that forms a complex with RTG1 to activate the retrograde (RTG) and target of rapamycin (TOR) pathways [33,34]. GLN3 encodes a transcription factor that is phosphorylated and localised to the cytoplasm when the cell is grown in nitrogen-rich media. Rapamycin treatment can induce the dephosphorylation and subsequent activation of GLN3 [35]. Figure 5 displays the layout of all the biclusters containing these two transcription factors. We note that all but one bicluster also includes either the transcription factor GAT1 or the transcription factor GCN4. GAT1 is a transcriptional activator of genes involved in nitrogen catabolite repression; the activity and localization of these genes is regulated by nitrogen limitation. GCN4 is another transcription activator that is a master regulator of gene expression during amino acid starvation in S. cerevisiae and is activated in multiple stress responses [36]. Thus, it is not surprising that GAT1 and GCN4 co-regulate genes with GLN3 and RTG3. The functional annotations of the set of nine genes targeted by GCN4, GLN3, and RTG3 is enriched in the Gene Ontology biological process "glutamine family amino acid biosynthesis" (p-value of 2 × 10-8, based on the hypergeometric distribution), indicating that this pathway may be activated by the three transcription factors upon rapamycin treatment. Figure 5 Genes combinatorially controlled by GLN3 and RTG3. A layout of nine biclusters of genes combinatorially controlled by GLN3 and RTG3 under exposure to rapamycin. 4.3 Classification of leukaemias Golub et al. [37] studied global expression patterns of 45 patients diagnosed with Acute Lymphoblastic Leukaemia (ALL) and 27 patients diagnosed with Acute Myeloid Leukaemia (AML). We ran the xMotif algorithm [11,21] to compute biclusters in this dataset. We ensured that computed biclusters contain samples from at most one class. We selected four representative biclusters from the results to visualize. Figure 4(b) displays the layout. Each column corresponds to a sample; the two columns at the top with purple cells indicate the type of leukaemia. We map the expression values of each gene into a range from green to red, with green (respectively, red) corresponding to the smallest (respectively, largest) expression value of that gene. The biclusters outlined in black correspond to AML samples and those outlined in blue to ALL samples. This layout visually highlights similarities and differences between the biclusters found in samples for the same and for different types of leukaemia. We have used such biclusters as the basis for constructing a classifier that distinguishes between different diseases and tissues (Grothaus and Murali, in preparation).