PMC:1524773 / 43550-50178
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"16737529-12884007-1692872","span":{"begin":3542,"end":3544},"obj":"12884007"},{"id":"16737529-15608167-1692873","span":{"begin":4201,"end":4203},"obj":"15608167"}],"text":"Runtime on artificial and biological datasets\nIn this section we want to demonstrate how fast the implementation of our simple algorithm is. We compare the runtime to that of Cliquer [8]. For the test we use both artificial as well as biological data sets. The artificial datasets were generated in the following way: For a given interval of some maximal length ℓmax we computed all possible sub-intervals within this interval. From this set of possible sub-intervals, a given percentage number of sub-intervals was chosen uniformly at random for the further evaluations. Note that the number of possible sub-intervals in an interval with length ℓ is roughly ℓ2, and thus the number of maximal cliques is O(ℓ6). We chose two values for ℓmax, namely 20 and 100. Then, the overall runtime to calculate all maximal cliques was determined for all constraints within 0.05 ≤ c ≤ 0.95 in steps of 0.05. For Cliquer we could only use an interval of length ℓmax = 20 and a randomly chosen set of 30% of all possible sub-intervals. For other test sets Cliquer had runtimes longer than many hours or even days. For ℓmax = 20 and c = 0.05 Cliquer took nearly three hours to compute all maximal cliques, for c = 0.1 it took nearly one hour to compute all maximal cliques, for c = 0.15, c = 0.2, c = 0.25 it took approximately 8, 4, and 1 minutes, respectively. For all c ≥ 0.3 it took under a minute to compute all maximal cliques and it took below 0.01 seconds for all c ≥ 0.5. Our implementation needed no more than 50 ms for any of the c-values.\nA second experiment was made on a much larger data set where the interval had a length of ℓmax = 100. We used 10%, 20%, 30%, 40% and 50% of the total set of possible sub-intervals chosen uniformly at random. Again, Cliquer could not compute the maximal cliques for any of the data sets in any reasonable time. The results for the runtimes of our algorithm for these data sets are visualized in Fig. 9. Note that the data sets contained 1000, 2000, 3000, 4000 and 5000 intervals, respectively. We see that in all cases and for all chosen constraint values c, the runtimes were below 30 seconds. Even for the largest data set with 5000 sub-intervals the computation of the full statistics needed less than 4 minutes.\nFigure 9 a) The runtimes shown in this figure are based on artificial data sets. Here, the maximal length is 100 and from all possible intervals a random set is chosen, ranging from 10% to 50% of all possible intervals. The time needed to compute all maximal cliques for all constraints from 0.05 to 0.95 is given in ms. b) gives the number of maximal cliques found in the data sets. It is an interesting observation that the runtime seems to follow the number of cliques in the set, in contrast to the runtime of the Cliquer that is mainly determined by how interwoven the cliques are. The runtime of our algorithm thus indicates that for these random data sets the runtime is bound by out. Biological data sets are very different in structure from random data sets, which is caused by the clusteredness of their sequences. Since genes are often composites of different functional domains, aligned sequences have a high probability to center around those domains and build groups of 'similar' sequences with respect to the position and length of their alignment to the query.\nFor the test of our algorithm on biological sequences three BLAST-derived data sets were chosen as follows: From a bacterial artificial chromosome (BAC) and fosmid library of the genome of Pristionchus pacificus [13] that contains 78690 sequences, we reconstructed a genetic element that we characterized as a transposon from the maT family (S. Steigele, unpublished). Experimental hybridisation assays proved that this transposon is highly repetitive in the Pristionchus pacificus genome (A. Breit, unpublished). A BLASTN search (E-value \u003c 10-6) of this transposon against all sequences of the BAC and fosmid library was conducted, resulting in 126 hits below the E-value threshold. We refer to this data set by PpmaT. Furthermore, two multi-domain proteins, both containing repeated domains, were subject to BLASTP searches (E-value \u003c 10-6) versus the UniProt database [11]. The protein SH3 and multiple ankyrin repeat domains protein 1, short Shank1, contains 7 ANK repeats, 1 PDZ domain, 1 SAM domain and 1 SH3 domain. The BLAST search resulted in 86 hits below the E-value threshold. We refer to this data set by Shank1.\nThe protein Cadherin EGF LAG seven-pass G-type receptor 2, short Celsr2, is a receptor that may have an important role in cell-cell signaling during nervous system formation. It belongs to the G-protein coupled receptor 2 family, LN-TM7 subfamily, and contains 4 cadherin domains, 7 EGF-like domains, 1 GPS domain, 1 laminin EGF-like domain, 2 laminin G-like domains and 1 transmembrane receptor domain. The BLAST search resulted in 249 hits below the E-value threshold. We refer to this data set by Celsr2. Fig. 10 shows the times needed to compute all maximal cliques for the given data sets for all c between 0.05 and 0.95 in steps of 0.05. Generally we observe in all three cases that the runtime within one data set is almost independent of the constraint value c, differing only by a factor of about 2 across each curve. For all cases runtime was well below one second, while Cliquer needed for some of the c-values many hours and even longer. We have therefore again not presented Cliquer's runtime results for these data sets. The computation of the full curves needed less than one second.\nFigure 10 All maximal cliques in three different biological data sets were computed. The time needed by our algorithm is given in ms in dependence of the chosen constraint value c. For the same data sets, Fig. 11a) shows the number of cliques found at these constraint values and Fig 11b) shows the average number of cliques a sequence is member of. In comparison to the artificially derived data sets, runtime in these cases seems to be more dependent on the input size than on the number of maximal cliques. Furthermore, there is also no obvious dependency between the chosen c-value and the number of maximal cliques. We also observe for quite a large range of constraint values c on average a sequence is contained in up to 30% of all maximal cliques. Even for large c-values (c \u003e 0.6) on average a sequence is contained in up to 10% of the maximal cliques.\nFigure 11 a) The number of maximal cliques found and b) the average number of maximal cliques a sequence is contained in are highly sensitive on the chosen c-value as can be seen for three biological data sets."}