EXMOTIF and RISO: comparison For comparison, we extract structured motifs from 1,062 non-coding sequences (a total of 196,736 nucleotides) located between two divergent genes in the genome of B. subtilis [15-17]. Figure 7 and 8 compare the running time (in seconds) for EXMOTIF and RISO using exact matching and approximate matching, respectively. Experiments were done for different gap ranges, number of components, and quorum thresholds. Note that EXMOTIF has two options: one (shown as "exMOTIF" in the figures) for reporting only the number of sequences where the structured motifs occur, the other (shown as "exMOTIF(#)") for reporting both the number of sequences where the structured motifs occur and the actual occurrences. Also note that the current implementation of RISO does not report the actual occurrences; it reports only the frequency. Figure 7 EXMOTIF vs. RISO: Exact Matching. Figure 8 EXMOTIF vs. RISO: Approximate Matching. Exact matching In the first experiment, shown in Figure 7(a), we randomly generated 100 structured motif templates, with k ∈ [2,4] simple motifs of length l ∈ [4,7] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [0, 200]. The x-axis is sorted on the number of motifs extracted. For clarity we plot average times for the methods when the number of motifs extracted fall into the given range on the x-axis. For example, the time plotted for the range [102, 103) is the average time for all the random templates that produce between 100 and 1000 motifs. We find that the average running time for RISO across all extracted motifs is 120.7s, whereas for EXMOTIF it takes 88.4s for reporting only the supports, and 91.3s for also reporting all the occurrences. The median times were 26.3s, 8.5s, and 9.2s, respectively, indicating a 3 times speed-up of EXMOTIF over RISO. In the next set of experiments we varied one parameter while keeping the others fixed. We set the default quorum to 12% (q = 127), the default gap ranges to [0,100], the default simple motif length to l = 4 (NNNN), and the default number of components k = 3 (e.g., NNNN[0,100]NNNN[0,100]NNNN). In Figure 7(b), we plot the time as a function of the number of simple motifs k in the template. We find that as the number of components increases the time gap between EXMOTIF and RISO increases; for k = 4 simple motifs, EXMOTIF is around 5 times faster than RISO. Figure 7(c) shows the effect of increasing gap ranges, from [0,0] to [0,200]. We find that as the gap range increases the time for EXMOTIF increases at a slower rate compared to RISO. For [0,200], EXMOTIF is 3–4 times faster than RISO depending whether only frequency or full occurrences are reported. In Figure 7(d), as the quorum threshold increases, the running time goes down for both methods. For quorum 24%, EXMOTIF is 4–5 times faster than RISO. As support decreases, the gap narrows somewhat, but EXMOTIF remains 2–3 times faster. Finally, Figure 7(e) plots the effect of increasing simple motif lengths l ∈ [2,6]. We find that the time first increases and then decreases. This is because there are a large number of motif occurrences for length 3 and length 4, but relatively few occurrences for length 5 and length 6. Depending on the motif lengths, EXMOTIF can be 3–40 times faster than RISO for comparable output, i.e., reporting only the support. EXMOTIF remains up to 5 times faster when also reporting the actual occurrences. To compare the performance for extracting structured motifs with length ranges, we used the template T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = M1[50, 100] M2[1,50]M3[20, 100]M4 with q = 12%, where |M1| ∈ [2,4], |M2| ∈ [3,4], |M3| ∈ [5,6], |M4| ∈ [4,5]. EXMOTIF took 78.4s, whereas RISO took 1640.9s to extract 14,174 motifs. Approximate matching In the first experiment, shown in Figure 8(a), we randomly generated 30 structured motif templates, with k ∈ [2,3] simple motifs of length l ∈ [3,6] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [10, 30]. The x-axis is sorted on the number of motifs extracted, and average times are plotted for the extracted number of motifs in the given range. We find that the average running time for RISO is 334.5s, whereas for EXMOTIF it takes 59.3s seconds for reporting only the support, and 176.7s for also reporting all the occurrences. Thus EXMOTIF is on average 5 times faster than RISO, with comparable output. Figures 8(b)–(e) plot the time for approximate matching as a function of different parameters. We set the default quorum to 12% (q = 127, out of |S MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFse=uaaa@3845@| = 1062 sequences), the default gap ranges to [12,22], the default simple motif length to l = 6 (NNNNNN), and the default number of components k = 2 (e.g., NNNNNN[12,22]NNNNNN). Figure 8(b) shows how increasing gap ranges effect the running time; for gap range [8,26] between the two motif components, EXMOTIF is 2–3 times faster than RISO. In Figure 8(c), we increase the numbers of arbitrary substitutions allowed for each simple motif; a pair (ε1, ε2) on the x-axis denotes that ε1 substitutions are allowed for motif component M1, and ε2 for M2. We can see that EXMOTIF is always faster than RISO. It is 9 times faster when only frequencies are reported, and it can be up to 5 times faster then full occurrences are reported, though for some cases the difference is slight. Figure 8(d) plots the effect of the quorum threshold. Compared to RISO, EXMOTIF performs much better for low quorum, e.g., for q = 4% EXMOTIF is 4–5 times faster than RISO. Finally in Figure 8(e), as the simple motif lengths increase, the time for both EXMOTIF and RISO increases, and we find that EXMOTIF can be 2–3 times faster. We also studied the effect of quorum and allowed substitutions. Table 4 shows the comparative results for EXMOTIF and RISO. Here we used the template T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNNNNN[12, 22]NNNNNN to extract motifs from the 1062 subsequences from B. subtilis. We vary the quorum from low (5%) to high (90%), and vary the number of errors ei per simple motif (with more errors allowed for higher quorum). For a comparable output (when only the frequency is reported), EXMOTIF outperforms RISO, especially for high quorum and high number of errors. It is interesting that for this latter case, reporting all occurrences incurs significant overhead. For example for q = 90% and with (e1 = 3, e2 = 3), EXMOTIF is 20 times faster than RISO, but EXMOTIF(#) is 3 times slower! Table 4 Comparison of EXMOTIF and RISO for different quorums and allowed substitutions. Quorum #Substitutions RISO EXMOTIF EXMOTIF(#) 5% (0, 0) 1.82s 1.42s 1.52s 30% (1, 1) 63.01s 58.91s 64.52s 60% (2, 2) 2763.31s 328.43s 2317.35s 90% (3, 3) 13682.13s 707.56s 41464.93s The template used is T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNNNNN[12,22]NNNNNN. #Substitutions shows the number of errors (e1, e2) allowed for the two simple components.