Approximate matching
In the first experiment, shown in Figure 8(a), we randomly generated 30 structured motif templates, with k ∈ [2,3] simple motifs of length l ∈ [3,6] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [10, 30]. The x-axis is sorted on the number of motifs extracted, and average times are plotted for the extracted number of motifs in the given range. We find that the average running time for RISO is 334.5s, whereas for EXMOTIF it takes 59.3s seconds for reporting only the support, and 176.7s for also reporting all the occurrences. Thus EXMOTIF is on average 5 times faster than RISO, with comparable output.
Figures 8(b)–(e) plot the time for approximate matching as a function of different parameters. We set the default quorum to 12% (q = 127, out of |S MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFse=uaaa@3845@| = 1062 sequences), the default gap ranges to [12,22], the default simple motif length to l = 6 (NNNNNN), and the default number of components k = 2 (e.g., NNNNNN[12,22]NNNNNN). Figure 8(b) shows how increasing gap ranges effect the running time; for gap range [8,26] between the two motif components, EXMOTIF is 2–3 times faster than RISO. In Figure 8(c), we increase the numbers of arbitrary substitutions allowed for each simple motif; a pair (ε1, ε2) on the x-axis denotes that ε1 substitutions are allowed for motif component M1, and ε2 for M2. We can see that EXMOTIF is always faster than RISO. It is 9 times faster when only frequencies are reported, and it can be up to 5 times faster then full occurrences are reported, though for some cases the difference is slight.
Figure 8(d) plots the effect of the quorum threshold. Compared to RISO, EXMOTIF performs much better for low quorum, e.g., for q = 4% EXMOTIF is 4–5 times faster than RISO. Finally in Figure 8(e), as the simple motif lengths increase, the time for both EXMOTIF and RISO increases, and we find that EXMOTIF can be 2–3 times faster.
We also studied the effect of quorum and allowed substitutions. Table 4 shows the comparative results for EXMOTIF and RISO. Here we used the template T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNNNNN[12, 22]NNNNNN to extract motifs from the 1062 subsequences from B. subtilis. We vary the quorum from low (5%) to high (90%), and vary the number of errors ei per simple motif (with more errors allowed for higher quorum). For a comparable output (when only the frequency is reported), EXMOTIF outperforms RISO, especially for high quorum and high number of errors. It is interesting that for this latter case, reporting all occurrences incurs significant overhead. For example for q = 90% and with (e1 = 3, e2 = 3), EXMOTIF is 20 times faster than RISO, but EXMOTIF(#) is 3 times slower!
Table 4  Comparison of EXMOTIF and RISO for different quorums and allowed substitutions.
Quorum  #Substitutions  RISO  EXMOTIF  EXMOTIF(#)
5%  (0, 0)  1.82s  1.42s  1.52s
30%  (1, 1)  63.01s  58.91s  64.52s
60%  (2, 2)  2763.31s  328.43s  2317.35s
90%  (3, 3)  13682.13s  707.56s  41464.93s
The template used is T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNNNNN[12,22]NNNNNN. #Substitutions shows the number of errors (e1, e2) allowed for the two simple components.