Comparison on random motifs
Here we used chromosome 20 of Homo Sapiens as the sequence; it has length 61M base pairs. We generate 100 random structured motifs in the ΣIUPAC alphabet, with k ∈ [3,8] simple motifs of length l ∈ [5,10] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [-5, 100]. Note that the negative minimum gap shows that SMOTIF can mine overlapping simple motifs. Here we also compare the structured profile search approach SMOTIF-P, as follows: for each random motif, we form a profile by first expanding the IUPAC symbols into their corresponding DNA symbols and assign them a random probability of occurrence which accounts for 90% of the share, whereas the other DNA symbols randomly share in the remaining 10%. We use SMOTIF-P to search for the profiles with 2 as the number of core positions in each simple motif, λc = 0.1 as the core score threshold and λ = 0.6 as the total score threshold. We use random motifs mainly to demonstrate the effects of various parameters on SMOTIF and SMARTFINDER.
Figure 13(a)–(d) show the results. Here we do not allow missing components. As noted before, we may find overlapping occurrences if a negative gap is present in a motif. Figure 13(a) shows how the running time varies with the sum of the number of occurrences of the simple motifs in each of the 100 random motifs. For clarity, each point reflects the average time for the number of occurrences in the given range on the x-axis. For example, the first point on the x-axis [0, 1) corresponds to the case when there are between 0 and 1 million occurrences found. The general trend is that it takes more time as the number of occurrences increases. Figure 13(b) shows the time with respect to the number of occurrences of the whole structured motif (again, for clarity, only average times are plotted for occurrences in the given ranges in the x-axis). We observe that the time increases slightly with increase in the occurrences. In general, the times are more sensitive to the number of intermediate (simple) occurrences. Figure 13(c) shows the effect of the number of simple components in the structured motif. Each point shows the average time over all motifs having the given number of simple motifs. Here again the time increases with increasing components. Finally Figure 13(d) shows the impact of the number of IUPAC symbols in the structured motifs; the trend being that the more the symbols the more time it takes to search. We also observe that the approaches scale linearly, on average, with respect to the different parameters. Also SMOTIF remains about 5–10 times faster than SMARTFINDER over all these experiments.
Figure 13  SMOTIF and SMARTFINDER Comparison: Random Motifs. The figure compares SMOTIF-1, SMOTIF-2 and SMARTFINDER, when searching for 100 randomly generated structured motifs in chromosome 20 of Homo sapiens. (a) shows how the running time varies with the sum of the number of occurrences of the simple motifs in each of the 100 random motifs. (b) shows the time with respect to the number of occurrences of the whole structured motif. (c) shows the effect of the number of simple motif components in the structured motif. (d) shows the impact of the number of IUPAC symbols in the structured motifs. Table 9 shows the mean and variance of the search times over all the 100 structured motifs. It also shows the time for finding only the start positions or the full positions for SMOTIF. Overall, for these random motifs, we find that on average SMOTIF-1 is the fastest, SMOTIF-2 and SMOTIF-P are comparable, and all three outperform SMARTFINDER by a factor of 4 to 6.
Table 9  Random Motifs: Mean and Variance
Algorithm  Mean(s)  Variance (s)
SMARTFINDER  44.42  24.85
SMOTIF-1 (full)  6.97  1.45
SMOTIF-1 (start)  6.93  1.46
SMOTIF-2 (full)  10.83  5.07
SMOTIF-2 (start)  10.81  5.07
SMOTIF-P (full)  9.67  2.95
SMOTIF-P (start)  9.66  2.97
full gives the time for full position recovery, whereas start gives the time for reporting only the start positions. It is interesting to note that SMOTIF is more stable than SMARTFINDER: SMOTIF-P has around 8 times less variance than SMARTFINDER, SMOTIF-2 has around 5 times less variance than SMARTFINDER, whereas SMOTIF-1 has around 17 times less variance than SMARTFINDER. Note also that the overhead in recovering the full positions from the start positions is negligible.