Real applications

Discovery of single transcription factor binding sites
We evaluate our algorithm by extracting the conserved features of known transcription factor binding sites in yeast. In particular we used the binding sites for the Zinc (Zn) factors [11]. There are 11 binding sites listed for the Zn cluster, 3 of which are simple motifs. The remaining 8 are structured, as shown in Table 5. For the evaluation, we first form several structured motif templates according to the conserved features in the binding sites. Then we extract the frequent structured motifs satisfying these templates from the upstream regions of 68 genes regulated by zinc factors [11]. We used the -1000 to -1 upstream regions, truncating the region if and where it overlaps with an upstream open-reading frame (ORF). After extraction, since binding sites cannot have many occurrences in the ORF regions, we drop some motifs if they also occur frequently in the ORF regions (i.e., within the genes). Finally, we calculate the Z-scores for the remaining frequent motifs, and rank them by descending Z-scores. In our experiments, we set the minimum quorum threshold to 7% within the upstream regions and the maximum support threshold to 30% in the ORF regions. We use the shuffling program from SMILE [14] to compute the Z-scores. The shufffing program randomly shuffles the original input sequences to obtain a new shuffled set of sequences.
Then it computes, for each extracted frequent motif, its support (π) and weighted support (πw) in the shuffled set. For a given frequent motif ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@, let μ and σ be the mean and standard deviation of its support across different sets (about 30) of shuffled sequences. Then the Z-score for each motif is calculated as: Z=π(ℳ)−μσ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaatuuDJXwAK1uy0HwmaeXbfv3ySLgzG0uy0Hgip5wzaGqbaiab=Lr8Ajabg2da9maalaaabaacciGae4hWdaNaeiikaGccdaGae03mH0KaeiykaKIaeyOeI0Iae4hVd0gabaGae43Wdmhaaaaa@4BE7@. Likewise we can also calculate the Z-score for each frequent motif by using the weighted support (which is also applicable for the repeated structured motif identification problem). As shown in Table 5, we can successfully predict GAL4, GAL4 chips, LEU3, PPR1 and PUT3 with the highest rank. CAT8 and LYS also have high ranks. We were thus able to extract all eight transcription factors for the Zinc factors with high confidence. As a comparison, with the same dataset RISO can only predict GAL4, LEU3 and PPR1.
Table 5  Regulons of Zn cluster proteins.
TF Name  Known Motif  Predicted Motifs  Num-Motifs  Ranking
GAL4GAL4 chips  CGGRnnRCYnYnCnCCG  CGG[11,11]CCG  1634(3346)  1/1
CAT8  CGGnnnnnnGGA  CGG[6,6]GGA  1621(3356)  147/13
HAP1  CGGnnnTAnCGGCGGnnnTAnCGGnnnTA  CGG[6,6]CGG  1621(3356)  111/146
LEU3  RCCGGnnCCGGY  CCG[4,4]CGG  1588(3366)  2/1
LYS  WWWTCCRnYGGAWWW  TCC[3,3]GGA  1605(3360)  33/21
PPR1  WYCGGnnWWYKCCGAW  CGG[6,6]CCG  1621(3356)  1/2
PUT3  YCGGnAnGCGnAnnnCCGACGGnAnGCnAnnnCCGA  CGG[10,11]CCG  727(4035)  1/1
TF Name stands for transcription factor name; Known Motif stands for the known binding sites corresponding to the transcription factors in TF Name column; Predicted Motifs stands for the motifs predicted by EXMOTIF; Num-Motifs gives the final (original) number of motifs extracted (final is after pruning those motifs that are also frequent in the ORF regions); Ranking stands for the Z-score ranking based on support/weighted support.

Discovery of composite regulatory patterns
The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of EXMOTIF is to extract such composite regulatory binding sites from DNA sequences. We took two such transcription factors, URS1H and UASH, which are involved in early meiotic expression during sporulation, and that are known to cooperatively regulate 11 yeast genes [24]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes (all except HOP1), and after their multiple alignment, we obtained their consensus: taTTTtGGAGTaata[4,179]ttGGCGGCTAA (the lower case letters are less conserved, whereas uppercase letters are the most conserved). Table 6 shows the binding sites for UASH and URS1H for the 10 genes, their start positions, their alignment, and the consensus pattern. The gap between the sites are obtained after subtracting the length of UASH, 15, from the position difference (since the start position of UASH is given). The smallest gap is l = 119 - 110 - 15 = 4 and the largest is u = 288 - 94 - 15 = 179. Based on the on most conserved parts of the consensus, we formed the composite motif template: T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNN[1,1]NNNNN[10,185]NNNNNNNNN (note the 6 additional gaps added to [4,179] to account for the non-conserved positions). We then extracted the structured motifs in the upstream regions of the 10 genes. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream ORF. The numbers of substitutions for NNN, NNNNN and NNNNNNNNN were set to ε1 = 1, ε2 = 2 and ε3 = 1, respectively. The quorum thresholds was set to q = 0.7 with the upstreams, and the maximum support within genes was set to 0.1% The rank of the true motif TTT[1,1]GGAGT[10,185]GGCGGCTAA was 290 (out of 5284 final motifs) with a Z-score of 22.61.
Table 6  UASH and URS1H binding sites.
Genes   UASH   URS1H   Gap
Site  Pos  Site  Pos
ZIP1  GATTCGGAAGTAAAA  -42  ==TCGGCGGCTAAAT  -22  5
MEI4  TCTTTCGGAGTCATA  -121  ==TGGGCGGCTAAAT  -98  8
DMC1  TTGTGTGGAGAGATA  -175  AAATAGCCGCCCA==  -143  17
SPO13  TAATTAGGAGTATAT  -119  AAATAGCCGCCGA==  -100  4
MER1  GGTTTTGTAGTTCTA  -152  TTTTAGCCGCCGA==  -115  22
SPO16  CATTGTGATGTATTT  -201  ==TGGGCGGCTAAAA  -90  96
REC104  CAATTTGGAGTAGGC  -182  ==TTGGCGGCTATTT  -93  74
RED1  ATTTCTGGAGATATC  -355  ==TCAGCGGCTAAAT  -167  173
REC114  GATTTTGTAGGAATA  -288  ==TGGGCGGCTAACT  -94  179
MEK1  TCATTTGTAGTTTAT  -233  ==ATGGCGGCTAAAT  -150  68
Consensus  taTTTtGGAGTaata   ==ttGGCGGCTAA==   [4,179]