PMC:1524773 / 43403-55698
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"16737529-12884007-1692872","span":{"begin":3689,"end":3691},"obj":"12884007"},{"id":"16737529-15608167-1692873","span":{"begin":4348,"end":4350},"obj":"15608167"}],"text":"5 Experiments\nThis section shows some results on using our implementation for finding all maximal cliques in artificial and biological data sets.\n\nRuntime on artificial and biological datasets\nIn this section we want to demonstrate how fast the implementation of our simple algorithm is. We compare the runtime to that of Cliquer [8]. For the test we use both artificial as well as biological data sets. The artificial datasets were generated in the following way: For a given interval of some maximal length ℓmax we computed all possible sub-intervals within this interval. From this set of possible sub-intervals, a given percentage number of sub-intervals was chosen uniformly at random for the further evaluations. Note that the number of possible sub-intervals in an interval with length ℓ is roughly ℓ2, and thus the number of maximal cliques is O(ℓ6). We chose two values for ℓmax, namely 20 and 100. Then, the overall runtime to calculate all maximal cliques was determined for all constraints within 0.05 ≤ c ≤ 0.95 in steps of 0.05. For Cliquer we could only use an interval of length ℓmax = 20 and a randomly chosen set of 30% of all possible sub-intervals. For other test sets Cliquer had runtimes longer than many hours or even days. For ℓmax = 20 and c = 0.05 Cliquer took nearly three hours to compute all maximal cliques, for c = 0.1 it took nearly one hour to compute all maximal cliques, for c = 0.15, c = 0.2, c = 0.25 it took approximately 8, 4, and 1 minutes, respectively. For all c ≥ 0.3 it took under a minute to compute all maximal cliques and it took below 0.01 seconds for all c ≥ 0.5. Our implementation needed no more than 50 ms for any of the c-values.\nA second experiment was made on a much larger data set where the interval had a length of ℓmax = 100. We used 10%, 20%, 30%, 40% and 50% of the total set of possible sub-intervals chosen uniformly at random. Again, Cliquer could not compute the maximal cliques for any of the data sets in any reasonable time. The results for the runtimes of our algorithm for these data sets are visualized in Fig. 9. Note that the data sets contained 1000, 2000, 3000, 4000 and 5000 intervals, respectively. We see that in all cases and for all chosen constraint values c, the runtimes were below 30 seconds. Even for the largest data set with 5000 sub-intervals the computation of the full statistics needed less than 4 minutes.\nFigure 9 a) The runtimes shown in this figure are based on artificial data sets. Here, the maximal length is 100 and from all possible intervals a random set is chosen, ranging from 10% to 50% of all possible intervals. The time needed to compute all maximal cliques for all constraints from 0.05 to 0.95 is given in ms. b) gives the number of maximal cliques found in the data sets. It is an interesting observation that the runtime seems to follow the number of cliques in the set, in contrast to the runtime of the Cliquer that is mainly determined by how interwoven the cliques are. The runtime of our algorithm thus indicates that for these random data sets the runtime is bound by out. Biological data sets are very different in structure from random data sets, which is caused by the clusteredness of their sequences. Since genes are often composites of different functional domains, aligned sequences have a high probability to center around those domains and build groups of 'similar' sequences with respect to the position and length of their alignment to the query.\nFor the test of our algorithm on biological sequences three BLAST-derived data sets were chosen as follows: From a bacterial artificial chromosome (BAC) and fosmid library of the genome of Pristionchus pacificus [13] that contains 78690 sequences, we reconstructed a genetic element that we characterized as a transposon from the maT family (S. Steigele, unpublished). Experimental hybridisation assays proved that this transposon is highly repetitive in the Pristionchus pacificus genome (A. Breit, unpublished). A BLASTN search (E-value \u003c 10-6) of this transposon against all sequences of the BAC and fosmid library was conducted, resulting in 126 hits below the E-value threshold. We refer to this data set by PpmaT. Furthermore, two multi-domain proteins, both containing repeated domains, were subject to BLASTP searches (E-value \u003c 10-6) versus the UniProt database [11]. The protein SH3 and multiple ankyrin repeat domains protein 1, short Shank1, contains 7 ANK repeats, 1 PDZ domain, 1 SAM domain and 1 SH3 domain. The BLAST search resulted in 86 hits below the E-value threshold. We refer to this data set by Shank1.\nThe protein Cadherin EGF LAG seven-pass G-type receptor 2, short Celsr2, is a receptor that may have an important role in cell-cell signaling during nervous system formation. It belongs to the G-protein coupled receptor 2 family, LN-TM7 subfamily, and contains 4 cadherin domains, 7 EGF-like domains, 1 GPS domain, 1 laminin EGF-like domain, 2 laminin G-like domains and 1 transmembrane receptor domain. The BLAST search resulted in 249 hits below the E-value threshold. We refer to this data set by Celsr2. Fig. 10 shows the times needed to compute all maximal cliques for the given data sets for all c between 0.05 and 0.95 in steps of 0.05. Generally we observe in all three cases that the runtime within one data set is almost independent of the constraint value c, differing only by a factor of about 2 across each curve. For all cases runtime was well below one second, while Cliquer needed for some of the c-values many hours and even longer. We have therefore again not presented Cliquer's runtime results for these data sets. The computation of the full curves needed less than one second.\nFigure 10 All maximal cliques in three different biological data sets were computed. The time needed by our algorithm is given in ms in dependence of the chosen constraint value c. For the same data sets, Fig. 11a) shows the number of cliques found at these constraint values and Fig 11b) shows the average number of cliques a sequence is member of. In comparison to the artificially derived data sets, runtime in these cases seems to be more dependent on the input size than on the number of maximal cliques. Furthermore, there is also no obvious dependency between the chosen c-value and the number of maximal cliques. We also observe for quite a large range of constraint values c on average a sequence is contained in up to 30% of all maximal cliques. Even for large c-values (c \u003e 0.6) on average a sequence is contained in up to 10% of the maximal cliques.\nFigure 11 a) The number of maximal cliques found and b) the average number of maximal cliques a sequence is contained in are highly sensitive on the chosen c-value as can be seen for three biological data sets.\n\nComputing a partition of S from maximal cliques\nSince the membership of a sequence is almost never unambiguous, a heuristic is needed in order to partition the set of sequences into meaningful clusters. The set of all maximal cliques for a given constraint c guarantees us that every two members of a maximal clique will at least share a part of their sequence of length (|Ii| * c). Let Ic(Ci) denote that part of the query sequence that is overlapped by all sequences in clique Ci, i.e.,\nIc(Ci) = [max {xj|Ij ∈ Ci}, min {yj|Ij ∈ Ci}] (6)\nIt is an important observation that this common part of the sequence does not have to overlap each sequence in Ci by c* 100%, as is shown in Fig. 12. If now a sequence X is a member of many maximal cliques, it should be assigned to that clique Ci whose shared sequence part Ic(Ci) overlaps X best, i.e. is maximal.\nFigure 12 With a constraint of 0.6 intervals 1, 2, 3 build a clique. The common overlap of all 3 intervals is [4-10], which is only 3/7 of interval 3. The other problem is that the number and structure of maximal cliques is of course depending on the chosen constraint value c. There are two opposing situations that make it hard to find the best c-value: The first is sketched in Fig 13a) where three different clusters of sequences are visible, but they will only be found if c is larger than 0.6. In Fig. 13b) there is only one cluster, but if c is larger than 0.6 then two maximal cliques will emerge.\nFigure 13 a) There are three distinguishable clusters: {1, 2, 3}, {4, 5, 6}, {7, 8, 9} but as long as c is below 4/11 there is only one maximal clique embracing all intervals. Not until c is larger than 3/5, the wanted three sets will emerge as maximal cliques. b) Here, the situation is different. Intuitively, only one set embracing all intervals should emerge. But this homogenous set will be split into two maximal cliques when c is greater than 3/5: one contains interval 5 and one interval 6. Thus, we have two contradicting goals: To get a higher specificity of the maximal cliques it is good to use a large c-value. But if it is too large it will split 'good' cliques into too many parts. Our aim was to find a heuristic that is based on a c-value that is small enough as not to split 'good' clusters but can find subsets of the maximal cliques that are more specific than the whole clique.\nFor this, the following heuristic is used:\n1. Add all maximal cliques to a list L.\n2. For all cliques C in L compute the shared overlap interval IC, i.e., the maximal interval that is overlapped by all of the intervals in the clique.\n3. For every interval I that is in more than one clique, compute the overlap between IC and I and multiply this value with the number of intervals in this clique.\n4. Assign every interval I to that clique where the product of overlap with IC and number of intervals in that clique is maximal.\nThis basic variant yields good results but it can be improved by the following procedure: For every pair A, B of cliques compute the intersection A ⋂ B. If there are many intervals in this intersection set, e.g., |A ⋂ B| \u003e 3, and only a few intervals that are not, the intersection of the two cliques represents intervals that are very similar to each other. These intervals would be in one clique if the c-value would have been more restrictive, i.e., larger. But due to the low c-value, there are other intervals to the left and right of the intervals in the intersection set that interact with the intervals in A ⋂ B but not with each other. As discussed above, a low c-value has the advantage of keeping good clusters together, and thus c should not be too high in the beginning. However, the effect of a larger c-value can be mimicked by adding those intersection sets to the list L in the first step of the above described mechanism and to allow the intervals to assign them to any set in L be it a clique or an intersection set.\nThe given heuristic guarantees that every two intervals in a set tolerate each other, because it is built on the maximal cliques of tolerating intervals. The decision of membership has been built on a composite of the overlap with the shared sequence and the size of the set to reduce the number of sets globally. The size of a set is of course volatile because not all of its designated members will finally be assigned to it. But in each case the overlap with the shared sequence of any set will never decrease if another member is deleted from the set, so the heuristic is stable in this respect.\nWe have already indicated at the beginning of this article (s. Fig. 1), that many partitions of a set S of sequences seem to be reasonable by regarding only their begin and ends. In order to facilitate a manual revision of the calculated partition of each set of sequences these sets are displayed in the order of the diagonal of their smallest member, i.e., the sum of its x-position and its length.\nThe results of this heuristic applied to the three biological data sets are given in Fig. 14a–c. The colorization helps to identify the members of one clique. When comparing the cliques with the annotation of the input query sequence individual domains as well as combination of several domains were retrieved and corresponded almost perfectly to the annotated domains of the input protein sequence.\nFigure 14 The partitions found by the heuristic described in the text, for a constraint value of c = 0.65. a) Shankl, b) Celsr2, c) PpmaT. Members of the same clique are colored equally."}