PMC:1524773 / 1629-9836
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/1524773","sourcedb":"PMC","sourceid":"1524773","source_url":"https://www.ncbi.nlm.nih.gov/pmc/1524773","text":"1 Introduction\nViewing the subject sequences aligned to a query sequence that result from a BLAST-based [1] comparison, in many cases one can identify groups of sequences clustering around different subintervals of the query sequence. Often, the decision by eye to which cluster a certain sequence belongs, is strongly depending on the order in which the sequences are presented. Fig. 1a) shows a schematic sketch of aligned sequences in random order. The sequences seem to form two, or maybe three groups. The same sequences in Fig. 1b) are ordered according to how many positions they have in common and colors indicate those sequences that share a large part of their sequence. The algorithm finds three different clusters of sequences. A cluster in this sense can be defined as a subset of aligned sequences that have approximately the same length and that are aligned to approximately the same subinterval of the query sequence. As we have argued above, the human eye may be fooled by the ordering of the presented sequences and humans are limited in the number of sequences they can group into clusters, and thus the automatic and objective computation of a clustering is an important task.\nFigure 1 The left side shows a set of 11 sequences that are aligned to a common reference sequence, but do not have any special ordering that makes it easy to distinguish clusters of similar sequences. On the right a reasonable ordering is given where colors indicate groups of sequences that have a 60% overlap for each pair of sequences. Note that the given coloring might not be the only reasonable.\nFigure 2 The semi-squares can be ordered according to the height of their hypotenuse at a given x-position. One kind of clustering can be obtained by using BlastClust from the NCBI-BLAST package [2]. It is a clustering tool designed especially to cluster protein or DNA sequences based on pairwise matches returned by the BLAST algorithm. It uses BLAST scores to assign statistical significance, matches pairs that reach that level of significance, then constructs clusters using a simple, greedy, single-linkage clustering method. However, empirically it is known that BlastClust is very limited with respect to the size of the input set. Furthermore, in many instances, the application of this heuristic is also limited by the fact that one unique clustering does not exist and that in fact many clusterings are possible and reasonable for different applications: for example, when annotating putative domains of a protein sequence, it is important to find common regions that are shared by large groups of sequences, while for the computation of phylogenetic trees it is important to find groups that share the largest possible part of their sequences.\nIn any case, under a given constraint, the groups should be as large as possible. Both constraints can be formulated as follows: Given two sequences, they are said to tolerate each other if both overlap at least an absolute amount t, the tolerance, or a relative tolerance c% of the length of the other sequence. This leads naturally to the family of max-tolerance graphs [3], where vertices represent intervals, and two vertices are connected by an edge if the corresponding sequences tolerate each other. The question for maximal clusters now reduces to finding maximal cliques in the max-tolerance graph.\nInterestingly, the history of a special max-tolerance graph with t = 1 is deeply intertwined with the study of the DNA: Benzer was the first to raise the question, whether the sub-elements of the DNA are arranged linearly, or not [4]. To answer this question, sequences of DNA are represented as vertices, where two vertices are connected by an edge if the corresponding sequences are overlapping on at least one position. The resulting graph is an interval graph [5,6]. It could be shown that a graph generated in this way can only be derived from fragments of a linear sequence if it has a certain property. Since the overlapping graph derived from DNA fragments showed this linearity property, this was yet another confirmation that the genome of organisms consists of long, linear DNA molecules. Searching for maximal cliques in interval graphs can be done efficiently in time linear to the size of the graph. However, for computing meaningful biological clusters, an absolute tolerance of t = 1 is not appropriate. For this purpose, it is necessary to increase the tolerance to a reasonable value and in addition to use the relative tolerance c of the length of the sequences. Thus, for c = 0.5 two sequences will tolerate each other if both have approximately the same length and share half of their sequence, or if one is not longer than twice the length of the other and overlaps the second completely, and anything in between. We call this kind of max-tolerance graph with a fixed relative tolerance c for all sequences a c-max-tolerance graph. The question of finding maximal groups of sequences that pairwise tolerate each other now reduces to the question of finding all maximal cliques, in the respective c-max-tolerance graph.\nFor general graphs, the computation of all maximal cliques is an NP-hard problem, since it can be reduced to the maximum clique problem which is again a classical NP-complete graph problem [7]. A well-known and popular branch-and-bound based software to compute maximum and maximal cliques in general graphs is Cliquer [8]. Cliquer's runtime is exponential. It will turn out that the maximal clique problem for the graphs considered here is not NP-hard.\nOur aim here is now to show that, based on results of general max-tolerance graphs, an efficient algorithm can be applied to find all maximal cliques in a c-max-tolerance graph. In addition, we show how this set of all maximally cliques can then be used to find reasonable clusterings for biological sequences aligned to each other.\nFocusing on the biological application of the idea of c-max-tolerance graphs, our approach is two-fold: First, we will give a practical algorithm to report all maximal cliques in polynomial time. This algorithm is conceptually simple, thus ensuring that it is easy to implement it correctly and numerically stable. We will also show in the experimental section that its implementation is very fast for typical data sizes in biological applications. Second, based on the general principle of the simple algorithm, we will present a more involved, output-sensitive algorithm in O(n2 log n + out), where out denotes the size of the output. This is a considerable improvement compared with the corresponding bound for general max-tolerance graphs obtained in [9].\nWith this two-fold approach we follow the ideas of the field of algorithm design [10], that stresses the point that an easy algorithm should always be preferred for an implementation unless the data is so big that the more efficient algorithm has to be used.\nThe result of the algorithm can then be used to cluster the data in an automated and objective fashion. We are facing two problems here: First, the number of maximal cliques is strongly depending on the chosen c-value. Second, for many instances most of the sequences are elements of more than one maximal clique. Thus, we will give a heuristic based on the set of all maximal cliques that is able to find a reasonable clustering. We illustrate the algorithm using artificial as well as three different biological examples. The first of the biological data sets is an example of a repeat analysis, where a transposon from the nematode Pristionchus pacificus is compared with reads of BAC clone data covering about half of the genome of that organism. The second and third example are proteins that are compared with the UniProt database [11]. The article is structured as follows: in Section 2 we give a short review on results that are essential for the understanding of the two algorithms together with required definitions. Section 3 presents the first, simpler algorithm, followed by the description of the algorithm with a new, lower runtime complexity for c-max-tolerance graphs in Section 4. Section 5 gives some results on experiments and Section 6 summarizes and discusses our results.","divisions":[{"label":"title","span":{"begin":0,"end":14}},{"label":"p","span":{"begin":15,"end":1196}},{"label":"figure","span":{"begin":1197,"end":1600}},{"label":"label","span":{"begin":1197,"end":1205}},{"label":"caption","span":{"begin":1207,"end":1600}},{"label":"p","span":{"begin":1207,"end":1600}},{"label":"figure","span":{"begin":1601,"end":1709}},{"label":"label","span":{"begin":1601,"end":1609}},{"label":"caption","span":{"begin":1611,"end":1709}},{"label":"p","span":{"begin":1611,"end":1709}},{"label":"p","span":{"begin":1710,"end":2757}},{"label":"p","span":{"begin":2758,"end":3365}},{"label":"p","span":{"begin":3366,"end":5106}},{"label":"p","span":{"begin":5107,"end":5560}},{"label":"p","span":{"begin":5561,"end":5893}},{"label":"p","span":{"begin":5894,"end":6653}},{"label":"p","span":{"begin":6654,"end":6912}}],"tracks":[{"project":"2_test","denotations":[{"id":"16737529-2231712-1692869","span":{"begin":105,"end":106},"obj":"2231712"},{"id":"16737529-16590553-1692870","span":{"begin":3597,"end":3598},"obj":"16590553"},{"id":"16737529-15608167-1692871","span":{"begin":7751,"end":7753},"obj":"15608167"}],"attributes":[{"subj":"16737529-2231712-1692869","pred":"source","obj":"2_test"},{"subj":"16737529-16590553-1692870","pred":"source","obj":"2_test"},{"subj":"16737529-15608167-1692871","pred":"source","obj":"2_test"}]}],"config":{"attribute types":[{"pred":"source","value type":"selection","values":[{"id":"2_test","color":"#93e0ec","default":true}]}]}}