Real applications Discovery of single transcription factor binding sites We evaluate our algorithm by extracting the conserved features of known transcription factor binding sites in yeast. In particular we used the binding sites for the Zinc (Zn) factors [11]. There are 11 binding sites listed for the Zn cluster, 3 of which are simple motifs. The remaining 8 are structured, as shown in Table 5. For the evaluation, we first form several structured motif templates according to the conserved features in the binding sites. Then we extract the frequent structured motifs satisfying these templates from the upstream regions of 68 genes regulated by zinc factors [11]. We used the -1000 to -1 upstream regions, truncating the region if and where it overlaps with an upstream open-reading frame (ORF). After extraction, since binding sites cannot have many occurrences in the ORF regions, we drop some motifs if they also occur frequently in the ORF regions (i.e., within the genes). Finally, we calculate the Z-scores for the remaining frequent motifs, and rank them by descending Z-scores. In our experiments, we set the minimum quorum threshold to 7% within the upstream regions and the maximum support threshold to 30% in the ORF regions. We use the shuffling program from SMILE [14] to compute the Z-scores. The shufffing program randomly shuffles the original input sequences to obtain a new shuffled set of sequences. Then it computes, for each extracted frequent motif, its support (π) and weighted support (πw) in the shuffled set. For a given frequent motif ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@, let μ and σ be the mean and standard deviation of its support across different sets (about 30) of shuffled sequences. Then the Z-score for each motif is calculated as: Z=π(ℳ)−μσ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaatuuDJXwAK1uy0HwmaeXbfv3ySLgzG0uy0Hgip5wzaGqbaiab=Lr8Ajabg2da9maalaaabaacciGae4hWdaNaeiikaGccdaGae03mH0KaeiykaKIaeyOeI0Iae4hVd0gabaGae43Wdmhaaaaa@4BE7@. Likewise we can also calculate the Z-score for each frequent motif by using the weighted support (which is also applicable for the repeated structured motif identification problem). As shown in Table 5, we can successfully predict GAL4, GAL4 chips, LEU3, PPR1 and PUT3 with the highest rank. CAT8 and LYS also have high ranks. We were thus able to extract all eight transcription factors for the Zinc factors with high confidence. As a comparison, with the same dataset RISO can only predict GAL4, LEU3 and PPR1. Table 5 Regulons of Zn cluster proteins. TF Name Known Motif Predicted Motifs Num-Motifs Ranking GAL4GAL4 chips CGGRnnRCYnYnCnCCG CGG[11,11]CCG 1634(3346) 1/1 CAT8 CGGnnnnnnGGA CGG[6,6]GGA 1621(3356) 147/13 HAP1 CGGnnnTAnCGGCGGnnnTAnCGGnnnTA CGG[6,6]CGG 1621(3356) 111/146 LEU3 RCCGGnnCCGGY CCG[4,4]CGG 1588(3366) 2/1 LYS WWWTCCRnYGGAWWW TCC[3,3]GGA 1605(3360) 33/21 PPR1 WYCGGnnWWYKCCGAW CGG[6,6]CCG 1621(3356) 1/2 PUT3 YCGGnAnGCGnAnnnCCGACGGnAnGCnAnnnCCGA CGG[10,11]CCG 727(4035) 1/1 TF Name stands for transcription factor name; Known Motif stands for the known binding sites corresponding to the transcription factors in TF Name column; Predicted Motifs stands for the motifs predicted by EXMOTIF; Num-Motifs gives the final (original) number of motifs extracted (final is after pruning those motifs that are also frequent in the ORF regions); Ranking stands for the Z-score ranking based on support/weighted support. Discovery of composite regulatory patterns The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of EXMOTIF is to extract such composite regulatory binding sites from DNA sequences. We took two such transcription factors, URS1H and UASH, which are involved in early meiotic expression during sporulation, and that are known to cooperatively regulate 11 yeast genes [24]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes (all except HOP1), and after their multiple alignment, we obtained their consensus: taTTTtGGAGTaata[4,179]ttGGCGGCTAA (the lower case letters are less conserved, whereas uppercase letters are the most conserved). Table 6 shows the binding sites for UASH and URS1H for the 10 genes, their start positions, their alignment, and the consensus pattern. The gap between the sites are obtained after subtracting the length of UASH, 15, from the position difference (since the start position of UASH is given). The smallest gap is l = 119 - 110 - 15 = 4 and the largest is u = 288 - 94 - 15 = 179. Based on the on most conserved parts of the consensus, we formed the composite motif template: T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFtepvaaa@3847@ = NNN[1,1]NNNNN[10,185]NNNNNNNNN (note the 6 additional gaps added to [4,179] to account for the non-conserved positions). We then extracted the structured motifs in the upstream regions of the 10 genes. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream ORF. The numbers of substitutions for NNN, NNNNN and NNNNNNNNN were set to ε1 = 1, ε2 = 2 and ε3 = 1, respectively. The quorum thresholds was set to q = 0.7 with the upstreams, and the maximum support within genes was set to 0.1% The rank of the true motif TTT[1,1]GGAGT[10,185]GGCGGCTAA was 290 (out of 5284 final motifs) with a Z-score of 22.61. Table 6 UASH and URS1H binding sites. Genes UASH URS1H Gap Site Pos Site Pos ZIP1 GATTCGGAAGTAAAA -42 ==TCGGCGGCTAAAT -22 5 MEI4 TCTTTCGGAGTCATA -121 ==TGGGCGGCTAAAT -98 8 DMC1 TTGTGTGGAGAGATA -175 AAATAGCCGCCCA== -143 17 SPO13 TAATTAGGAGTATAT -119 AAATAGCCGCCGA== -100 4 MER1 GGTTTTGTAGTTCTA -152 TTTTAGCCGCCGA== -115 22 SPO16 CATTGTGATGTATTT -201 ==TGGGCGGCTAAAA -90 96 REC104 CAATTTGGAGTAGGC -182 ==TTGGCGGCTATTT -93 74 RED1 ATTTCTGGAGATATC -355 ==TCAGCGGCTAAAT -167 173 REC114 GATTTTGTAGGAATA -288 ==TGGGCGGCTAACT -94 179 MEK1 TCATTTGTAGTTTAT -233 ==ATGGCGGCTAAAT -150 68 Consensus taTTTtGGAGTaata ==ttGGCGGCTAA== [4,179]