6 Experimental Results Experiments were performed on both synthetic data and real data. Two different methods were used in the global phase: random start and random projection. The main purpose of this paper is not to demonstrate that our algorithm can outperform the existing motif finding algorithms. Rather, the main work here focuses on improving the results that are obtained from other efficient algorithms. We have chosen to demonstrate the performance of our algorithm on the results obtained from the random projection method which is a powerful global method that has outperformed other traditional motif finding approaches like MEME, Gibbs sampling, WINNOWER, SP-STAR, etc. [1]. Since the comparison was already published, we mainly focus on the performance improvements of our algorithm as compared to the random projection algorithm. For the random start experiment, a total of N random numbers between 1 and (t - l + 1) corresponding to initial set of alignments are generated. We then proceeded to evaluate our Exit-point methodology from these alignments. 6.1 Synthetic Datasets The synthetic datasets were generated by implanting some motif instances into t = 20 sequences each of length ti = 600. Let m correspond to one full random projection + EM cycle. We have set m = 1 to demonstrate the efficiency of our approach. We compared the performance coefficient (PC) which gives a measure of the average performance of our implementation compared to that of Random Projection. The PC is given by : P C = | K ∩ P | | K ∪ P |       ( 9 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqWGdbWqcqGH9aqpdaWcaaqaamaaemaabaGaem4saSKaeyykICSaemiuaafacaGLhWUaayjcSdaabaWaaqWaaeaacqWGlbWscqGHQicYcqWGqbauaiaawEa7caGLiWoaaaGaaCzcaiaaxMaadaqadaqaaiabiMda5aGaayjkaiaawMcaaaaa@41D9@ where K is the set of the residue positions of the planted motif instances, and P is the corresponding set of positions predicted by the algorithm. Table 3 gives an overview of the performance of our method compared to the random projection algorithm on the (l, d) motif problem for different l and d values. Table 3 Improvements in the Performance Coefficient. Motif (l, d) PC obtained using Random Projection PC obtained using Exit-point method (11,2) 20 20 (15,4) 14.875 17 (20,6) 12.667 18 The results of performance coefficient with m = 1 on synthetically generated sequences. The IC scores are not normalized and the perfect score is 20 since there are 20 sequences. Our results show that by branching out and discovering multiple local optimal solutions, higher m values are not needed. A higher m value corresponds to more computational time because projecting the l-mers into k-sized buckets is a time consuming task. Using our approach, we can replace the need for randomly projecting l-mers repeatedly in an effort to converge to a global optimum by deterministically and systematically searching the solution space modeled by our dynamical system and improving the quality of the existing solutions. We can see that for higher length motifs, the improvements are more significant. Fig. 4 shows the Tier-1 solutions obtained from a given consensus pattern. Since the exit points are being used instead of saddle points, our method might sometimes find the same local optimal solution obtained before. As seen from the figure, the Tier-1 solutions can differ from the original pattern by more than just one nucleotide position. Also, the function value at the exit points is much higher than the original value. As opposed to stochastic processes like mutations in genetic algorithms, our approach reduces the stochastic nature and obtains the nearby local optimal solutions systematically. Fig. 5 shows the performance of the Exit-point approach on synthetic data for different (l, d) motifs. The average scores of the ten best solutions obtained from random starts and their corresponding improvements in Tier-1 and Tier-2 are reported. One can see that the improvements become more prominent as the length of the motif is increased. Table 4 shows the best and worst of these top ten random starts along with the consensus pattern and the alignment scores. Figure 5 The average scores with the corresponding first tier and second tier improvements on synthetic data using the random starts with Exit-point approach with different (l, d) motifs. Table 4 Improvements in the Information Content Scores. (l, d) Initial Pattern Score First Tier Pattern Score Second Tier Pattern Score (11,2) AACGGTCGCAG 125.1 CCCGGTCGCTG 147.1 CCCGGGAGCTG 153.3 (11,2) ATACCAGTTAC 145.7 ATACCAGTTTC 151.3 ATACCAGGGTC 153.6 (13,3) CTACGGTCGTCTT 142.6 CCACGGTTGTCTC 157.8 CCTCGGGTTTGTC 158.7 (13,3) GACGCTAGGGGGT 158.3 GAGGCTGGGCAGT 161.7 GACCTTGGGTATT 165.8 (15,4) CCGAAAAGAGTCCGA 147.5 CCGCAATGACTGGGT 169.1 CCGAAAGGACTGCGT 176.2 (15,4) TGGGTGATGCCTATG 164.6 TGGGTGATGCCTATG 166.7 TGAGAGATGCCTATG 170.4 (17,5) TTGTAGCAAAGGCTAAA 143.3 CAGTAGCAAAGACTACC 173.3 CAGTAGCAAAGACTTCC 175.8 (17,5) ATCGCGAAAGGTTGTGG 174.1 ATCGCGAAAGGATGTGG 176.7 ATTGCGAAAGAATGTGG 178.3 (20,6) CTGGTGATTGAGATCATCAT 165.9 CAGATGGTTGAGATCACCTT 186.9 CATTTAGCTGAGTTCACCTT 194.9 (20,6) GGTCACTTAGTGGCGCCATG 216.3 GGTCACTTAGTGGCGCCATG 218.8 CGTCACTTAGTCGCGCCATG 219.7 The consensus patterns and their corresponding scores of the original local optimal solution obtained from multiple random starts on the synthetic data. The best first tier and second tier optimal patterns and their corresponding scores are also reported. With a few modifications, more experiments were conducted using the Random Projection method. The Random Projection method will eliminate non-promising regions in the search space and gives a number of promising sets of initial patterns. EM refinement is applied to only the promising initial patterns. Due to the robustness of the results, the Exit-point method is employed only on the top five local optima. The Exit-point method is again repeated on the top scoring first tier solutions to arrive at the second tier solutions. Fig. 6 shows the average alignment scores of the best random projection alignments and their corresponding improvements in tier-1 and tier-2 are reported. In general, the improvement in the first tier solutions are more significant than the improvements in the second tier solutions. Figure 6 The average scores with the corresponding first tier and second tier improvements on synthetic data using the Random Projection with Exit-point approach with different (l, d) motifs. 6.2 Real Datasets Table 5 shows the results of the Exit-point methodology on real biological sequences. We have chosen l = 20 and d = 2. 't' indicates the number of sequences in the real data. For the biological samples taken from [1,12], the value m once again is the average number of random projection + EM cycles required to discover the motif. All other parameter values (like projection size k = 7 and threshold s = 4) are chosen to be the same as those used in the Random projection paper [1]. All of the motifs were recovered with m = 1 using the Exit-point strategy. The Random Projection algorithm alone needed multiple cycles (m = 8 in some cases and m = 15 in others) in order to retrieve the correct motif. This elucidates the fact that global methods can only be used to a certain extent and should be combined with refined local heuristics in order to obtain better efficiency. Since the random projection algorithm has outperformed other prominent motif finding algorithms like SP-STAR, WINNOWER, Gibbs sampling etc., we did not repeat the same experiments that were conducted in [1]. Running one cycle of random projection + EM is much more expensive computationally. The main advantage of our strategy comes from the deterministic nature of our algorithm in refining motifs. Table 5 Results on real datasets. Results of Exit-point method on biological samples. The real motifs were obtained in all the six cases using the Exit-point framework. Sequence Sample Size t Best (20,2) Motif Reference Motif E. coli CRP 1890 18 TGTGAAATAGATCACATTTT TGTGANNNNGNTCACA preproinsulin 7689 4 GGAAATTGCAGCCTCAGCCC CCTCAGCCC DHFR 800 4 CTGCAATTTCGCGCCAAACT ATTTCNNGCCA metallothionein 6823 4 CCCTCTGCGCCCGGACCGGT TGCRCYCGG c-fos 3695 5 CCATATTAGGACATCTGCGT CCATATTAGAGACTCT yeast ECB 5000 5 GTATTTCCCGTTTAGGAAAA TTTCCCNNTNAGGAAA Let the cost of applying EM algorithm for a given bucket be f and let the average number of buckets for a given projection be b. Then the running time of the Exit-point method will be O(cbf) where c is a constant that is linear in l-the length of the motif. If there were m projections, then cost of the random projection algorithm using restarts will be O(mbf). The two main advantages of using Exit-point strategy compared to random projection algorithm are : • It avoids multiple random projections which often provide similar optimal motifs. • It provides multiple optimal solutions in a promising region of a given bucket as opposed to a single solution provided by random projection algorithm.