Experimental results EXMOTIF has been implemented in C++, and compiled with g++ v4.0.0 at optimization level 3 (-O3). We performed experiments on a Macintosh PowerPC G5 with dual 2.7GHz processors and 4GB memory running Mac OS X vl0.4.5. We compare our results with the latest version of RISO [15-17] (called RISOTTO [17]), the best previous algorithm for structured motif extraction problem. EXMOTIF and RISO: comparison For comparison, we extract structured motifs from 1,062 non-coding sequences (a total of 196,736 nucleotides) located between two divergent genes in the genome of B. subtilis [15-17]. Figure 7 and 8 compare the running time (in seconds) for EXMOTIF and RISO using exact matching and approximate matching, respectively. Experiments were done for different gap ranges, number of components, and quorum thresholds. Note that EXMOTIF has two options: one (shown as "exMOTIF" in the figures) for reporting only the number of sequences where the structured motifs occur, the other (shown as "exMOTIF(#)") for reporting both the number of sequences where the structured motifs occur and the actual occurrences. Also note that the current implementation of RISO does not report the actual occurrences; it reports only the frequency. Figure 7 EXMOTIF vs. RISO: Exact Matching. Figure 8 EXMOTIF vs. RISO: Approximate Matching. Exact matching In the first experiment, shown in Figure 7(a), we randomly generated 100 structured motif templates, with k ∈ [2,4] simple motifs of length l ∈ [4,7] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [0, 200]. The x-axis is sorted on the number of motifs extracted. For clarity we plot average times for the methods when the number of motifs extracted fall into the given range on the x-axis. For example, the time plotted for the range [102, 103) is the average time for all the random templates that produce between 100 and 1000 motifs. We find that the average running time for RISO across all extracted motifs is 120.7s, whereas for EXMOTIF it takes 88.4s for reporting only the supports, and 91.3s for also reporting all the occurrences. The median times were 26.3s, 8.5s, and 9.2s, respectively, indicating a 3 times speed-up of EXMOTIF over RISO. In the next set of experiments we varied one parameter while keeping the others fixed. We set the default quorum to 12% (q = 127), the default gap ranges to [0,100], the default simple motif length to l = 4 (NNNN), and the default number of components k = 3 (e.g., NNNN[0,100]NNNN[0,100]NNNN). In Figure 7(b), we plot the time as a function of the number of simple motifs k in the template. We find that as the number of components increases the time gap between EXMOTIF and RISO increases; for k = 4 simple motifs, EXMOTIF is around 5 times faster than RISO. Figure 7(c) shows the effect of increasing gap ranges, from [0,0] to [0,200]. We find that as the gap range increases the time for EXMOTIF increases at a slower rate compared to RISO. For [0,200], EXMOTIF is 3–4 times faster than RISO depending whether only frequency or full occurrences are reported. In Figure 7(d), as the quorum threshold increases, the running time goes down for both methods. For quorum 24%, EXMOTIF is 4–5 times faster than RISO. As support decreases, the gap narrows somewhat, but EXMOTIF remains 2–3 times faster. Finally, Figure 7(e) plots the effect of increasing simple motif lengths l ∈ [2,6]. We find that the time first increases and then decreases. This is because there are a large number of motif occurrences for length 3 and length 4, but relatively few occurrences for length 5 and length 6. Depending on the motif lengths, EXMOTIF can be 3–40 times faster than RISO for comparable output, i.e., reporting only the support. EXMOTIF remains up to 5 times faster when also reporting the actual occurrences. To compare the performance for extracting structured motifs with length ranges, we used the template T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = M1[50, 100] M2[1,50]M3[20, 100]M4 with q = 12%, where |M1| ∈ [2,4], |M2| ∈ [3,4], |M3| ∈ [5,6], |M4| ∈ [4,5]. EXMOTIF took 78.4s, whereas RISO took 1640.9s to extract 14,174 motifs. Approximate matching In the first experiment, shown in Figure 8(a), we randomly generated 30 structured motif templates, with k ∈ [2,3] simple motifs of length l ∈ [3,6] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [10, 30]. The x-axis is sorted on the number of motifs extracted, and average times are plotted for the extracted number of motifs in the given range. We find that the average running time for RISO is 334.5s, whereas for EXMOTIF it takes 59.3s seconds for reporting only the support, and 176.7s for also reporting all the occurrences. Thus EXMOTIF is on average 5 times faster than RISO, with comparable output. Figures 8(b)–(e) plot the time for approximate matching as a function of different parameters. We set the default quorum to 12% (q = 127, out of |S MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFse=uaaa@3845@| = 1062 sequences), the default gap ranges to [12,22], the default simple motif length to l = 6 (NNNNNN), and the default number of components k = 2 (e.g., NNNNNN[12,22]NNNNNN). Figure 8(b) shows how increasing gap ranges effect the running time; for gap range [8,26] between the two motif components, EXMOTIF is 2–3 times faster than RISO. In Figure 8(c), we increase the numbers of arbitrary substitutions allowed for each simple motif; a pair (ε1, ε2) on the x-axis denotes that ε1 substitutions are allowed for motif component M1, and ε2 for M2. We can see that EXMOTIF is always faster than RISO. It is 9 times faster when only frequencies are reported, and it can be up to 5 times faster then full occurrences are reported, though for some cases the difference is slight. Figure 8(d) plots the effect of the quorum threshold. Compared to RISO, EXMOTIF performs much better for low quorum, e.g., for q = 4% EXMOTIF is 4–5 times faster than RISO. Finally in Figure 8(e), as the simple motif lengths increase, the time for both EXMOTIF and RISO increases, and we find that EXMOTIF can be 2–3 times faster. We also studied the effect of quorum and allowed substitutions. Table 4 shows the comparative results for EXMOTIF and RISO. Here we used the template T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNNNNN[12, 22]NNNNNN to extract motifs from the 1062 subsequences from B. subtilis. We vary the quorum from low (5%) to high (90%), and vary the number of errors ei per simple motif (with more errors allowed for higher quorum). For a comparable output (when only the frequency is reported), EXMOTIF outperforms RISO, especially for high quorum and high number of errors. It is interesting that for this latter case, reporting all occurrences incurs significant overhead. For example for q = 90% and with (e1 = 3, e2 = 3), EXMOTIF is 20 times faster than RISO, but EXMOTIF(#) is 3 times slower! Table 4 Comparison of EXMOTIF and RISO for different quorums and allowed substitutions. Quorum #Substitutions RISO EXMOTIF EXMOTIF(#) 5% (0, 0) 1.82s 1.42s 1.52s 30% (1, 1) 63.01s 58.91s 64.52s 60% (2, 2) 2763.31s 328.43s 2317.35s 90% (3, 3) 13682.13s 707.56s 41464.93s The template used is T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNNNNN[12,22]NNNNNN. #Substitutions shows the number of errors (e1, e2) allowed for the two simple components. Real applications Discovery of single transcription factor binding sites We evaluate our algorithm by extracting the conserved features of known transcription factor binding sites in yeast. In particular we used the binding sites for the Zinc (Zn) factors [11]. There are 11 binding sites listed for the Zn cluster, 3 of which are simple motifs. The remaining 8 are structured, as shown in Table 5. For the evaluation, we first form several structured motif templates according to the conserved features in the binding sites. Then we extract the frequent structured motifs satisfying these templates from the upstream regions of 68 genes regulated by zinc factors [11]. We used the -1000 to -1 upstream regions, truncating the region if and where it overlaps with an upstream open-reading frame (ORF). After extraction, since binding sites cannot have many occurrences in the ORF regions, we drop some motifs if they also occur frequently in the ORF regions (i.e., within the genes). Finally, we calculate the Z-scores for the remaining frequent motifs, and rank them by descending Z-scores. In our experiments, we set the minimum quorum threshold to 7% within the upstream regions and the maximum support threshold to 30% in the ORF regions. We use the shuffling program from SMILE [14] to compute the Z-scores. The shufffing program randomly shuffles the original input sequences to obtain a new shuffled set of sequences. Then it computes, for each extracted frequent motif, its support (π) and weighted support (πw) in the shuffled set. For a given frequent motif ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@, let μ and σ be the mean and standard deviation of its support across different sets (about 30) of shuffled sequences. Then the Z-score for each motif is calculated as: Z=π(ℳ)−μσ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaatuuDJXwAK1uy0HwmaeXbfv3ySLgzG0uy0Hgip5wzaGqbaiab=Lr8Ajabg2da9maalaaabaacciGae4hWdaNaeiikaGccdaGae03mH0KaeiykaKIaeyOeI0Iae4hVd0gabaGae43Wdmhaaaaa@4BE7@. Likewise we can also calculate the Z-score for each frequent motif by using the weighted support (which is also applicable for the repeated structured motif identification problem). As shown in Table 5, we can successfully predict GAL4, GAL4 chips, LEU3, PPR1 and PUT3 with the highest rank. CAT8 and LYS also have high ranks. We were thus able to extract all eight transcription factors for the Zinc factors with high confidence. As a comparison, with the same dataset RISO can only predict GAL4, LEU3 and PPR1. Table 5 Regulons of Zn cluster proteins. TF Name Known Motif Predicted Motifs Num-Motifs Ranking GAL4GAL4 chips CGGRnnRCYnYnCnCCG CGG[11,11]CCG 1634(3346) 1/1 CAT8 CGGnnnnnnGGA CGG[6,6]GGA 1621(3356) 147/13 HAP1 CGGnnnTAnCGGCGGnnnTAnCGGnnnTA CGG[6,6]CGG 1621(3356) 111/146 LEU3 RCCGGnnCCGGY CCG[4,4]CGG 1588(3366) 2/1 LYS WWWTCCRnYGGAWWW TCC[3,3]GGA 1605(3360) 33/21 PPR1 WYCGGnnWWYKCCGAW CGG[6,6]CCG 1621(3356) 1/2 PUT3 YCGGnAnGCGnAnnnCCGACGGnAnGCnAnnnCCGA CGG[10,11]CCG 727(4035) 1/1 TF Name stands for transcription factor name; Known Motif stands for the known binding sites corresponding to the transcription factors in TF Name column; Predicted Motifs stands for the motifs predicted by EXMOTIF; Num-Motifs gives the final (original) number of motifs extracted (final is after pruning those motifs that are also frequent in the ORF regions); Ranking stands for the Z-score ranking based on support/weighted support. Discovery of composite regulatory patterns The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of EXMOTIF is to extract such composite regulatory binding sites from DNA sequences. We took two such transcription factors, URS1H and UASH, which are involved in early meiotic expression during sporulation, and that are known to cooperatively regulate 11 yeast genes [24]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes (all except HOP1), and after their multiple alignment, we obtained their consensus: taTTTtGGAGTaata[4,179]ttGGCGGCTAA (the lower case letters are less conserved, whereas uppercase letters are the most conserved). Table 6 shows the binding sites for UASH and URS1H for the 10 genes, their start positions, their alignment, and the consensus pattern. The gap between the sites are obtained after subtracting the length of UASH, 15, from the position difference (since the start position of UASH is given). The smallest gap is l = 119 - 110 - 15 = 4 and the largest is u = 288 - 94 - 15 = 179. Based on the on most conserved parts of the consensus, we formed the composite motif template: T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNN[1,1]NNNNN[10,185]NNNNNNNNN (note the 6 additional gaps added to [4,179] to account for the non-conserved positions). We then extracted the structured motifs in the upstream regions of the 10 genes. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream ORF. The numbers of substitutions for NNN, NNNNN and NNNNNNNNN were set to ε1 = 1, ε2 = 2 and ε3 = 1, respectively. The quorum thresholds was set to q = 0.7 with the upstreams, and the maximum support within genes was set to 0.1% The rank of the true motif TTT[1,1]GGAGT[10,185]GGCGGCTAA was 290 (out of 5284 final motifs) with a Z-score of 22.61. Table 6 UASH and URS1H binding sites. Genes UASH URS1H Gap Site Pos Site Pos ZIP1 GATTCGGAAGTAAAA -42 ==TCGGCGGCTAAAT -22 5 MEI4 TCTTTCGGAGTCATA -121 ==TGGGCGGCTAAAT -98 8 DMC1 TTGTGTGGAGAGATA -175 AAATAGCCGCCCA== -143 17 SPO13 TAATTAGGAGTATAT -119 AAATAGCCGCCGA== -100 4 MER1 GGTTTTGTAGTTCTA -152 TTTTAGCCGCCGA== -115 22 SPO16 CATTGTGATGTATTT -201 ==TGGGCGGCTAAAA -90 96 REC104 CAATTTGGAGTAGGC -182 ==TTGGCGGCTATTT -93 74 RED1 ATTTCTGGAGATATC -355 ==TCAGCGGCTAAAT -167 173 REC114 GATTTTGTAGGAATA -288 ==TGGGCGGCTAACT -94 179 MEK1 TCATTTGTAGTTTAT -233 ==ATGGCGGCTAAAT -150 68 Consensus taTTTtGGAGTaata ==ttGGCGGCTAA== [4,179]