PMC:1524773 / 46186-46878 JSONTXT

On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences Abstract Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n3 + out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n2 log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets. 1 Introduction Viewing the subject sequences aligned to a query sequence that result from a BLAST-based [1] comparison, in many cases one can identify groups of sequences clustering around different subintervals of the query sequence. Often, the decision by eye to which cluster a certain sequence belongs, is strongly depending on the order in which the sequences are presented. Fig. 1a) shows a schematic sketch of aligned sequences in random order. The sequences seem to form two, or maybe three groups. The same sequences in Fig. 1b) are ordered according to how many positions they have in common and colors indicate those sequences that share a large part of their sequence. The algorithm finds three different clusters of sequences. A cluster in this sense can be defined as a subset of aligned sequences that have approximately the same length and that are aligned to approximately the same subinterval of the query sequence. As we have argued above, the human eye may be fooled by the ordering of the presented sequences and humans are limited in the number of sequences they can group into clusters, and thus the automatic and objective computation of a clustering is an important task. Figure 1 The left side shows a set of 11 sequences that are aligned to a common reference sequence, but do not have any special ordering that makes it easy to distinguish clusters of similar sequences. On the right a reasonable ordering is given where colors indicate groups of sequences that have a 60% overlap for each pair of sequences. Note that the given coloring might not be the only reasonable. Figure 2 The semi-squares can be ordered according to the height of their hypotenuse at a given x-position. One kind of clustering can be obtained by using BlastClust from the NCBI-BLAST package [2]. It is a clustering tool designed especially to cluster protein or DNA sequences based on pairwise matches returned by the BLAST algorithm. It uses BLAST scores to assign statistical significance, matches pairs that reach that level of significance, then constructs clusters using a simple, greedy, single-linkage clustering method. However, empirically it is known that BlastClust is very limited with respect to the size of the input set. Furthermore, in many instances, the application of this heuristic is also limited by the fact that one unique clustering does not exist and that in fact many clusterings are possible and reasonable for different applications: for example, when annotating putative domains of a protein sequence, it is important to find common regions that are shared by large groups of sequences, while for the computation of phylogenetic trees it is important to find groups that share the largest possible part of their sequences. In any case, under a given constraint, the groups should be as large as possible. Both constraints can be formulated as follows: Given two sequences, they are said to tolerate each other if both overlap at least an absolute amount t, the tolerance, or a relative tolerance c% of the length of the other sequence. This leads naturally to the family of max-tolerance graphs [3], where vertices represent intervals, and two vertices are connected by an edge if the corresponding sequences tolerate each other. The question for maximal clusters now reduces to finding maximal cliques in the max-tolerance graph. Interestingly, the history of a special max-tolerance graph with t = 1 is deeply intertwined with the study of the DNA: Benzer was the first to raise the question, whether the sub-elements of the DNA are arranged linearly, or not [4]. To answer this question, sequences of DNA are represented as vertices, where two vertices are connected by an edge if the corresponding sequences are overlapping on at least one position. The resulting graph is an interval graph [5,6]. It could be shown that a graph generated in this way can only be derived from fragments of a linear sequence if it has a certain property. Since the overlapping graph derived from DNA fragments showed this linearity property, this was yet another confirmation that the genome of organisms consists of long, linear DNA molecules. Searching for maximal cliques in interval graphs can be done efficiently in time linear to the size of the graph. However, for computing meaningful biological clusters, an absolute tolerance of t = 1 is not appropriate. For this purpose, it is necessary to increase the tolerance to a reasonable value and in addition to use the relative tolerance c of the length of the sequences. Thus, for c = 0.5 two sequences will tolerate each other if both have approximately the same length and share half of their sequence, or if one is not longer than twice the length of the other and overlaps the second completely, and anything in between. We call this kind of max-tolerance graph with a fixed relative tolerance c for all sequences a c-max-tolerance graph. The question of finding maximal groups of sequences that pairwise tolerate each other now reduces to the question of finding all maximal cliques, in the respective c-max-tolerance graph. For general graphs, the computation of all maximal cliques is an NP-hard problem, since it can be reduced to the maximum clique problem which is again a classical NP-complete graph problem [7]. A well-known and popular branch-and-bound based software to compute maximum and maximal cliques in general graphs is Cliquer [8]. Cliquer's runtime is exponential. It will turn out that the maximal clique problem for the graphs considered here is not NP-hard. Our aim here is now to show that, based on results of general max-tolerance graphs, an efficient algorithm can be applied to find all maximal cliques in a c-max-tolerance graph. In addition, we show how this set of all maximally cliques can then be used to find reasonable clusterings for biological sequences aligned to each other. Focusing on the biological application of the idea of c-max-tolerance graphs, our approach is two-fold: First, we will give a practical algorithm to report all maximal cliques in polynomial time. This algorithm is conceptually simple, thus ensuring that it is easy to implement it correctly and numerically stable. We will also show in the experimental section that its implementation is very fast for typical data sizes in biological applications. Second, based on the general principle of the simple algorithm, we will present a more involved, output-sensitive algorithm in O(n2 log n + out), where out denotes the size of the output. This is a considerable improvement compared with the corresponding bound for general max-tolerance graphs obtained in [9]. With this two-fold approach we follow the ideas of the field of algorithm design [10], that stresses the point that an easy algorithm should always be preferred for an implementation unless the data is so big that the more efficient algorithm has to be used. The result of the algorithm can then be used to cluster the data in an automated and objective fashion. We are facing two problems here: First, the number of maximal cliques is strongly depending on the chosen c-value. Second, for many instances most of the sequences are elements of more than one maximal clique. Thus, we will give a heuristic based on the set of all maximal cliques that is able to find a reasonable clustering. We illustrate the algorithm using artificial as well as three different biological examples. The first of the biological data sets is an example of a repeat analysis, where a transposon from the nematode Pristionchus pacificus is compared with reads of BAC clone data covering about half of the genome of that organism. The second and third example are proteins that are compared with the UniProt database [11]. The article is structured as follows: in Section 2 we give a short review on results that are essential for the understanding of the two algorithms together with required definitions. Section 3 presents the first, simpler algorithm, followed by the description of the algorithm with a new, lower runtime complexity for c-max-tolerance graphs in Section 4. Section 5 gives some results on experiments and Section 6 summarizes and discusses our results. 2 Mathematical background In this section we will define the formal basis and repeat some earlier derived results that are essential for the understanding of the rest of the article. Let S denote a set of n intervals Ii = [xi, yi], 1 ≤ i ≤ n, xi, yi ∈ ℕ, xi c · |Ii|. In a general max-tolerance graph each interval Ii is associated with a tolerance 0 ≤ ti ≤ |Ii| and here, two intervals are said to tolerate each other, if |Ii ⋂ Ij| ≥ max{ti, tj}. Given a graph G = (V, E), a clique is a subset C ⊆ V of pairwise-connected vertices. A clique C is called a maximal clique if there is no other clique C' ⊃ C. A clique is a maximum clique if its cardinality is of largest possible size. The following lemma has been derived for general max-tolerance graphs and applies also to c-max-tolerance graphs [9]: Lemma 1 The number of maximal cliques in (c-)max-tolerance graphs is O(n3). This result has been derived by a fundamental equivalence, for which we will need the following definitions: Let S = {[x1, y1], [x2, y2],..., [xn, yn]} be a set of intervals and let 0 It is an interesting observation that the runtime seems to follow the number of cliques in the set, in contrast to the runtime of the Cliquer that is mainly determined by how interwoven the cliques are. The runtime of our algorithm thus indicates that for these random data sets the runtime is bound by out. Biological data sets are very different in structure from random data sets, which is caused by the clusteredness of their sequences. Since genes are often composites of different functional domains, aligned sequences have a high probability to center around those domains and build groups of 'similar' sequences with respect to the position and length of their alignment to the query. For the test of our algorithm on biological sequences three BLAST-derived data sets were chosen as follows: From a bacterial artificial chromosome (BAC) and fosmid library of the genome of Pristionchus pacificus [13] that contains 78690 sequences, we reconstructed a genetic element that we characterized as a transposon from the maT family (S. Steigele, unpublished). Experimental hybridisation assays proved that this transposon is highly repetitive in the Pristionchus pacificus genome (A. Breit, unpublished). A BLASTN search (E-value < 10-6) of this transposon against all sequences of the BAC and fosmid library was conducted, resulting in 126 hits below the E-value threshold. We refer to this data set by PpmaT. Furthermore, two multi-domain proteins, both containing repeated domains, were subject to BLASTP searches (E-value < 10-6) versus the UniProt database [11]. The protein SH3 and multiple ankyrin repeat domains protein 1, short Shank1, contains 7 ANK repeats, 1 PDZ domain, 1 SAM domain and 1 SH3 domain. The BLAST search resulted in 86 hits below the E-value threshold. We refer to this data set by Shank1. The protein Cadherin EGF LAG seven-pass G-type receptor 2, short Celsr2, is a receptor that may have an important role in cell-cell signaling during nervous system formation. It belongs to the G-protein coupled receptor 2 family, LN-TM7 subfamily, and contains 4 cadherin domains, 7 EGF-like domains, 1 GPS domain, 1 laminin EGF-like domain, 2 laminin G-like domains and 1 transmembrane receptor domain. The BLAST search resulted in 249 hits below the E-value threshold. We refer to this data set by Celsr2. Fig. 10 shows the times needed to compute all maximal cliques for the given data sets for all c between 0.05 and 0.95 in steps of 0.05. Generally we observe in all three cases that the runtime within one data set is almost independent of the constraint value c, differing only by a factor of about 2 across each curve. For all cases runtime was well below one second, while Cliquer needed for some of the c-values many hours and even longer. We have therefore again not presented Cliquer's runtime results for these data sets. The computation of the full curves needed less than one second. Figure 10 All maximal cliques in three different biological data sets were computed. The time needed by our algorithm is given in ms in dependence of the chosen constraint value c. For the same data sets, Fig. 11a) shows the number of cliques found at these constraint values and Fig 11b) shows the average number of cliques a sequence is member of. In comparison to the artificially derived data sets, runtime in these cases seems to be more dependent on the input size than on the number of maximal cliques. Furthermore, there is also no obvious dependency between the chosen c-value and the number of maximal cliques. We also observe for quite a large range of constraint values c on average a sequence is contained in up to 30% of all maximal cliques. Even for large c-values (c > 0.6) on average a sequence is contained in up to 10% of the maximal cliques. Figure 11 a) The number of maximal cliques found and b) the average number of maximal cliques a sequence is contained in are highly sensitive on the chosen c-value as can be seen for three biological data sets. Computing a partition of S from maximal cliques Since the membership of a sequence is almost never unambiguous, a heuristic is needed in order to partition the set of sequences into meaningful clusters. The set of all maximal cliques for a given constraint c guarantees us that every two members of a maximal clique will at least share a part of their sequence of length (|Ii| * c). Let Ic(Ci) denote that part of the query sequence that is overlapped by all sequences in clique Ci, i.e., Ic(Ci) = [max {xj|Ij ∈ Ci}, min {yj|Ij ∈ Ci}]       (6) It is an important observation that this common part of the sequence does not have to overlap each sequence in Ci by c* 100%, as is shown in Fig. 12. If now a sequence X is a member of many maximal cliques, it should be assigned to that clique Ci whose shared sequence part Ic(Ci) overlaps X best, i.e. is maximal. Figure 12 With a constraint of 0.6 intervals 1, 2, 3 build a clique. The common overlap of all 3 intervals is [4-10], which is only 3/7 of interval 3. The other problem is that the number and structure of maximal cliques is of course depending on the chosen constraint value c. There are two opposing situations that make it hard to find the best c-value: The first is sketched in Fig 13a) where three different clusters of sequences are visible, but they will only be found if c is larger than 0.6. In Fig. 13b) there is only one cluster, but if c is larger than 0.6 then two maximal cliques will emerge. Figure 13 a) There are three distinguishable clusters: {1, 2, 3}, {4, 5, 6}, {7, 8, 9} but as long as c is below 4/11 there is only one maximal clique embracing all intervals. Not until c is larger than 3/5, the wanted three sets will emerge as maximal cliques. b) Here, the situation is different. Intuitively, only one set embracing all intervals should emerge. But this homogenous set will be split into two maximal cliques when c is greater than 3/5: one contains interval 5 and one interval 6. Thus, we have two contradicting goals: To get a higher specificity of the maximal cliques it is good to use a large c-value. But if it is too large it will split 'good' cliques into too many parts. Our aim was to find a heuristic that is based on a c-value that is small enough as not to split 'good' clusters but can find subsets of the maximal cliques that are more specific than the whole clique. For this, the following heuristic is used: 1. Add all maximal cliques to a list L. 2. For all cliques C in L compute the shared overlap interval IC, i.e., the maximal interval that is overlapped by all of the intervals in the clique. 3. For every interval I that is in more than one clique, compute the overlap between IC and I and multiply this value with the number of intervals in this clique. 4. Assign every interval I to that clique where the product of overlap with IC and number of intervals in that clique is maximal. This basic variant yields good results but it can be improved by the following procedure: For every pair A, B of cliques compute the intersection A ⋂ B. If there are many intervals in this intersection set, e.g., |A ⋂ B| > 3, and only a few intervals that are not, the intersection of the two cliques represents intervals that are very similar to each other. These intervals would be in one clique if the c-value would have been more restrictive, i.e., larger. But due to the low c-value, there are other intervals to the left and right of the intervals in the intersection set that interact with the intervals in A ⋂ B but not with each other. As discussed above, a low c-value has the advantage of keeping good clusters together, and thus c should not be too high in the beginning. However, the effect of a larger c-value can be mimicked by adding those intersection sets to the list L in the first step of the above described mechanism and to allow the intervals to assign them to any set in L be it a clique or an intersection set. The given heuristic guarantees that every two intervals in a set tolerate each other, because it is built on the maximal cliques of tolerating intervals. The decision of membership has been built on a composite of the overlap with the shared sequence and the size of the set to reduce the number of sets globally. The size of a set is of course volatile because not all of its designated members will finally be assigned to it. But in each case the overlap with the shared sequence of any set will never decrease if another member is deleted from the set, so the heuristic is stable in this respect. We have already indicated at the beginning of this article (s. Fig. 1), that many partitions of a set S of sequences seem to be reasonable by regarding only their begin and ends. In order to facilitate a manual revision of the calculated partition of each set of sequences these sets are displayed in the order of the diagonal of their smallest member, i.e., the sum of its x-position and its length. The results of this heuristic applied to the three biological data sets are given in Fig. 14a–c. The colorization helps to identify the members of one clique. When comparing the cliques with the annotation of the input query sequence individual domains as well as combination of several domains were retrieved and corresponded almost perfectly to the annotated domains of the input protein sequence. Figure 14 The partitions found by the heuristic described in the text, for a constraint value of c = 0.65. a) Shankl, b) Celsr2, c) PpmaT. Members of the same clique are colored equally. 6 Discussion BLAST is a very common algorithm in bioinformatics used for annotation of an unknown sequence or for sampling sequences from a database sharing homologous regions with the input sequence. One goal of BLAST is to identify regions of conserved DNA or protein sequence. Another goal is to detect groups of sequences that are highly similar with respect to the position and the length of conservation with the query sequence. For a given query sequence, BLAST returns the hits ordered statistical significance of each hit. When viewed, this ordering does not necessarily represent a clustering with respect to length of the hits and position in the query sequence. Thus, extraction of possible clusterings of the sequences is not always straightforward. One goal of this article was therefore to provide a method that allows an automatic clustering of sequences returned from a BLAST run, such that the user can decide whether to maximize the lengths of the common region of the sequences within a cluster or whether to maximize the size of the clusters. In this article we have shown that finding groups from BLAST reports can be reduced to computing maximal cliques in so-called c-max-tolerance graphs. We have presented two algorithms that allow the enumeration of all maximal cliques in c-max-tolerance graphs in polynomial time. The first algorithm has an easy implementation to compute all maximal cliques in c-max-tolerance graphs, but it has a theoretical runtime of O(n6). The second algorithm is technically very elaborate, and shows that enumeration of all maximal cliques in c-max-tolerance graphs can be computed in a runtime of O(n2 log n + out), where out denotes the size of the output. The algorithm improves a result found for general max-tolerance graphs, for which the computation of all maximal cliques has a runtime of O(n3) [9]. Though our simple algorithm in theory has a higher runtime than the more technical one, when applied to real data, runtimes were nonetheless impressive. Comparison with the popular application Cliquer [8], another exact algorithm to compute all maximal cliques in general graphs, showed that for many instances of our test cases, Cliquer could not report results within hours, while our algorithm finished most computations within milliseconds. The maximal cliques represent a clustering of the sequences detected with BLAST. Thus with the help of this clique-finding algorithm, clusters can now be quickly and automatically identified even from very large BLAST reports. The result of the clustering can then be further processed for example for phylogenetic reconstructions of the sequences in the clusters. At this point one might discuss the biological usefulness to compute all maximal cliques. As we have seen from the biological data sets used for our experiments, it is very common to have largely interlinked sequences, i.e., where one unique clustering and/or maximal clique does not exist. This is for example a common feature in the case of multidomain proteins, where the query protein contains say k domains, and BLAST reported subject proteins will show a variety of combinations of these k domains. A goal of applying the maximal cliques search algorithm is to cluster individual domains as well as all possible combinations of domains. The examples in figure 14 illustrate the result of these clusterings. Here one also clearly sees that other reasonable partitions are possible. Another application for this algorithm is the prediction of gene boundaries within a genomic sequence. For this, mapping of cDNA/ESTs to a genome sequence is used. The goal of this strategy is to predict the gene boundaries and/or its exon/intron structure using cDNA/ESTs. Applying our simple maximal clique algorithm to a large test case (with more than 2800 intervals), for which a human genomic locus was compared against the EST database from the NCBI, yielded results within a few seconds with a perfect clustering of EST loci (data not shown). The use of the c-value is an important parameter for biological applications. Again we think of the situation of reconstructing phylogenetic trees from sequences reported from a BLAST run. Here different c parameters can be used to either optimize the length of the overlaps or the size of the maximal cliques. A model where the factor c is not a tolerance relative to the length but a fixed constant independent of the length might be of some biological relevance. An extension towards such a model or even a mixture seems plausible. Examples from biological data sets indicated that most of the sequences are contained in several cliques. Again, it is not straightforward to deduce a disjoint clustering of the sequences from the maximal cliques. We have described a heuristic that computes a partitioning of sequences from the set of all maximal cliques in a c-max-tolerance graph. This heuristic when applied to protein sequences containing several domains partitions the set of all maximal cliques into clusters, where the sequences within the clusters reflect the domain structure of the input sequence. After cliques and clusters have been computed postprocessing such as visualization of the groups or integration into pipelines that need clustered sequence data is now very easy and automizable. And therefore we hope that the algorithms described here have further impact for the bioinformatics community. Availability The implementation of the first algorithm is available on request from the first author. Authors' contributions KAL implemented the simple version of the maximal cliques algorithm and performed the test runs. MK and KAL developed the efficient algorithm, proved its polynomial complexity and wrote most of the manuscript. SS retrieved the biological data and participated in the discussions. KN developed the project idea and participated in the manuscript preparation. All authors read and approved the final manuscript. Acknowledgements The second author would like to thank Markus Geyer for helpful discussions about the output-sensitive algorithm. This work has been supported by the DFG Grant Ka812-13/l 'Evolutionstheorien für natürliche und technische Netzwerke'. SS and KN were supported by the Deutsche Forschungsgemeinschaft, AZ BIZ 1/1-3.

Document structure show

Annnotations

blinded