Application: composite regulatory patterns The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of SMOTIF is to search for such composite regulatory binding sites in DNA sequences. We took two such transcription factors, URS1H and UASH, that are known to cooperatively regulate 11 yeast genes [28]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes, and after their multiple alignment, we obtained the composite motif NNDTBNGDWGDNNDH[5,179]WBRGCSGCYVW, where we represent each column in the alignment with the IUPAC symbol corresponding to the bases that appear at that position. We also extracted the profile for these 10 binding sites. Table 10 shows the binding sites for the 10 genes, their alignment, and the start positions and the distances between the sites (the difference of start positions). The smallest distance is 20 and the largest is 194. Since these are start positions, the variable gap range is obtained by subtracting the length (15) of UASH to obtain l = 20 – 15 = 5 and u = 194 – 15 = 179. Notice also how the alignment preserves the highly conserved GCSGC region in URS1H. Table 10 UASH and URS1H Binding Sites Genes UASH URS1H Distance Site Pos Site Pos ZIP1 GATTCGGAAGTAAAA -42 ==TCGGCGGCTAAAT -22 20 MEI4 TCTTTCGGAGTCATA -121 ==TGGGCGGCTAAAT -98 23 DMC1 TTGTGTGGAGAGATA -175 AAATAGCCGCCCA== -143 32 SPO13 TAATTAGGAGTATAT -119 AAATAGCCGCCGA== -100 19 MER1 GGTTTTGTAGTTCTA -152 TTTTAGCCGCCGA== -115 37 SPO16 CATTGTGATGTATTT -201 ==TGGGCGGCTAAAA -90 111 REC104 CAATTTGGAGTAGGC -182 ==TTGGCGGCTATTT -93 89 RED1 ATTTCTGGAGATATC -355 ==TCAGCGGCTAAAT -167 188 REC114 GATTTTGTAGGAATA -288 ==TGGGCGGCTAACT -94 194 MEK1 TCATTTGTAGTTTAT -233 ==ATGGCGGCTAAAT -150 83 Motif NNDTBNGDWGDNNDH ==WBRGCSGCYVW== [5,179] We then searched for the structured motif in the upstream regions of all 5873 genes in the yeast genome. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream open reading frame (ORF). As a result, 5794 sequences with average length of 497 bases are left. By searching for the IUPAC pattern, we found 65 occurrences, including the 10 originally known sites, within 1 second. By searching for the profile with 5 as the number of core positions in each simple motif, λc = 0.6 as the core score threshold and λ = 0.8 as the total score threshold, we found 56 occurrences in 0.18 seconds. For each occurrence, we then extracted its actual sequence segment in the matching upstream regions. Since the structured motif represented by IUPAC symbols may be too general, for each matching segment, we calculated its hamming distance to one of the 10 known binding sites. We then selected an occurrence as a possibly new binding site if the minimum hamming distance to any of the 10 known sites is within a given maximum threshold value. Table 11 lists the 12 newly found occurrences in the entire yeast genome (upstream regions) using a hamming distance threshold of 5. The sites discovered using both pattern and profile search are listed. These new occurrences could be putative binding sites for the two transcription factors UASH and URS1H. Table 11 Potential Binding Sites Genes UASH URS1H Hamming Distance Site Pos Site Pos MES1 GATTTTGAAGTAGGA -438 TTAGCCGCCGA -246 5:MER1 YJL045W TTTTGTGAAGAGATA -407 TTAGCCGCTCA -273 4:DMC1 HSP60 GTTTTTGTAGGTATA -329 ATAGCCGCCCA -252 5:MER1 SPO1 ATTTTTGAAGTTAAC -192 TCAGCGGCTAT -90 5:RED1 MEK1 TCATTTGTAGTTTAT -233 TCGGCGGCTAT -136 3:MEK1 YIG1 ATTTCCGGAGTTTTC -183 TCGGCGGCTAT -140 5:RED1 †AGP1 CCTTTTGATGACTTT -786 TCGGCGGCTAA -699 5:SPO16 †AGPl CCTTTTGATGACTTT -786 TCGGCGGCTAA -668 5:SPO16 †REC114 CATTTTGGTGGGTTC -158 TGGGCGGCTAA -94 5:SPO16 †GNTl TCATTTGGAGAATAT -340 ATAGCCGCCAT -299 5:SPO13 ‡MEK1 TTATATGCAGTATAT -276 ATGGCGGCTAA -150 4:MEK1 ‡MMS1 AACTCTGTAGTTATA -643 TGGGCGGCTAA -497 5:REC114 For each occurrence we give the gene names corresponding to the upstream region, the sites and positions for UASH and URS1H, and also the hamming distance and the closest known gene with the cooperative binding sites. For example, 5:SPO16 in the first row means that the hamming distance between AGP1 and SPO16 was 5. †: found only by IUPAC pattern search, ‡: found only by profile search. Upon further analysis, we found that in fact, the new occurrence in MEK1 (at positions -233,-136) that we found is also listed in the SCPD database as a binding site. SCPD lists one site for UASH at position -233, and two sites for URS1H at positions -136 and -150. To construct the motif, we had used -150 as the site for URS1H, without knowledge of the other site. SMOTIF was thus able to automatically find the other site based on the extracted motif! For REC114, we also found another occurrence at positions -158 (UASH) and -94 (URS1H). However, this is not reported in SCPD. To further analyze the remaining new occurrences, we consulted the SGD (Saccharomyces Genome Database) Gene Ontology Term Finder [29] to find the inter-relationships between the genes. The first three rows of Table 12 show the significant GO terms (biological process or molecular function) that are common to the genes corresponding to a new occurrence and its closest (known) gene. The rest of the table shows the significant terms among the 18 genes. These results indicate that at least some of the new occurrences (such as SPO1, HSP60, MES1, and GNT1) have a potential to be binding sites since they share some significant processes with the known sites' genes. Out of these SPO1 has the highest potential to be a new binding site, since it is known that UASH and URS1H are involved in early meiotic expression, during sporulation [28]. Table 12 shows that SPO1 shares meiosis and M phase of meiotic cell cycle with the rest of the genes. After searching for SPO1 in SGD database, we found that SPO1 is a transcriptional regulator involved in sporulation, and required for middle and late meiotic expression. This increases our confidence that SPO1 has high potential to be a previously unknown binding site. Table 12 Genes and Significant Gene Ontology (GO) Terms Genes Significant GO Terms p-value MES1, MER1 RNA metabolism 5.7e-3 HSP60, MER1 nucleic acid binding 7.9e-3 SPO1, RED1 M phase-meiotic cell cycle, meiotic cell cycle, meiosis, M phase 1.1e-3 ‡ MMS1, REC114 DNA recombination, DNA metabolism 6.0e-3 MES1, REC114, † GNT1, MEK1, MEI4, DMC1, MER1, REC104 biopolymer metabolism, macromolecule metabolism 2.3e-4 REC114, SPO1, MEK1, ZIP1, MEI4, DMC1, SPO13, MER1, REC104, RED1, HOP1 meiosis, M phase of meiotic cell cycle, meiotic cell cycle, M phase, cell cycle 1.6e-14 HSP60, DMC1, RED1, HOP1 DNA binding 6.7e-7 HSP60, DMC1, RED1 structure-specific DNA binding 3.1e-8 HSP60, DMC1 single-stranded DNA binding 3.7e-6 MEI4, DMC1, REC104, REC114, ‡ MMS1 DNA recombination, DNA metabolism 2.8e-6 MEI4, DMC1, MER1, REC104, REC114, MEK1, MES1, ‡ MMS1 biopolymer metabolism, macromolecule metabolism 2.3e-4 † : found only by IUPAC pattern. search, ‡ : found only by profile search. Finally, since we knew that in gene HOP1, the URS1H binding site appears upstream from UASH, we wanted to see if we could extract the "reversed" binding site. We search for the original and the reversed motifs using a hamming threshold of 6. We found 34 new binding sites where UASH can appear either up-or down-stream from URS1H. Among these we found two possible potential binding sites for the gene HOP1, with UASH at position -201 and URS1H at positions -534 and -175. The former pair (-201,-534) is in fact a known binding site as reported in the SCPD database [1]. This once again showcases the ability of SMOTIF to find potential new binding sites.