6 Discussion
BLAST is a very common algorithm in bioinformatics used for annotation of an unknown sequence or for sampling sequences from a database sharing homologous regions with the input sequence. One goal of BLAST is to identify regions of conserved DNA or protein sequence. Another goal is to detect groups of sequences that are highly similar with respect to the position and the length of conservation with the query sequence. For a given query sequence, BLAST returns the hits ordered statistical significance of each hit. When viewed, this ordering does not necessarily represent a clustering with respect to length of the hits and position in the query sequence. Thus, extraction of possible clusterings of the sequences is not always straightforward. One goal of this article was therefore to provide a method that allows an automatic clustering of sequences returned from a BLAST run, such that the user can decide whether to maximize the lengths of the common region of the sequences within a cluster or whether to maximize the size of the clusters.
In this article we have shown that finding groups from BLAST reports can be reduced to computing maximal cliques in so-called c-max-tolerance graphs. We have presented two algorithms that allow the enumeration of all maximal cliques in c-max-tolerance graphs in polynomial time. The first algorithm has an easy implementation to compute all maximal cliques in c-max-tolerance graphs, but it has a theoretical runtime of O(n6). The second algorithm is technically very elaborate, and shows that enumeration of all maximal cliques in c-max-tolerance graphs can be computed in a runtime of O(n2 log n + out), where out denotes the size of the output. The algorithm improves a result found for general max-tolerance graphs, for which the computation of all maximal cliques has a runtime of O(n3) [9].
Though our simple algorithm in theory has a higher runtime than the more technical one, when applied to real data, runtimes were nonetheless impressive. Comparison with the popular application Cliquer [8], another exact algorithm to compute all maximal cliques in general graphs, showed that for many instances of our test cases, Cliquer could not report results within hours, while our algorithm finished most computations within milliseconds.
The maximal cliques represent a clustering of the sequences detected with BLAST. Thus with the help of this clique-finding algorithm, clusters can now be quickly and automatically identified even from very large BLAST reports. The result of the clustering can then be further processed for example for phylogenetic reconstructions of the sequences in the clusters. At this point one might discuss the biological usefulness to compute all maximal cliques. As we have seen from the biological data sets used for our experiments, it is very common to have largely interlinked sequences, i.e., where one unique clustering and/or maximal clique does not exist. This is for example a common feature in the case of multidomain proteins, where the query protein contains say k domains, and BLAST reported subject proteins will show a variety of combinations of these k domains. A goal of applying the maximal cliques search algorithm is to cluster individual domains as well as all possible combinations of domains. The examples in figure 14 illustrate the result of these clusterings. Here one also clearly sees that other reasonable partitions are possible.
Another application for this algorithm is the prediction of gene boundaries within a genomic sequence. For this, mapping of cDNA/ESTs to a genome sequence is used. The goal of this strategy is to predict the gene boundaries and/or its exon/intron structure using cDNA/ESTs. Applying our simple maximal clique algorithm to a large test case (with more than 2800 intervals), for which a human genomic locus was compared against the EST database from the NCBI, yielded results within a few seconds with a perfect clustering of EST loci (data not shown).
The use of the c-value is an important parameter for biological applications. Again we think of the situation of reconstructing phylogenetic trees from sequences reported from a BLAST run. Here different c parameters can be used to either optimize the length of the overlaps or the size of the maximal cliques. A model where the factor c is not a tolerance relative to the length but a fixed constant independent of the length might be of some biological relevance. An extension towards such a model or even a mixture seems plausible.
Examples from biological data sets indicated that most of the sequences are contained in several cliques. Again, it is not straightforward to deduce a disjoint clustering of the sequences from the maximal cliques. We have described a heuristic that computes a partitioning of sequences from the set of all maximal cliques in a c-max-tolerance graph. This heuristic when applied to protein sequences containing several domains partitions the set of all maximal cliques into clusters, where the sequences within the clusters reflect the domain structure of the input sequence.
After cliques and clusters have been computed postprocessing such as visualization of the groups or integration into pipelines that need clustered sequence data is now very easy and automizable. And therefore we hope that the algorithms described here have further impact for the bioinformatics community.