Application: composite regulatory patterns
The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of SMOTIF is to search for such composite regulatory binding sites in DNA sequences. We took two such transcription factors, URS1H and UASH, that are known to cooperatively regulate 11 yeast genes [28]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes, and after their multiple alignment, we obtained the composite motif NNDTBNGDWGDNNDH[5,179]WBRGCSGCYVW, where we represent each column in the alignment with the IUPAC symbol corresponding to the bases that appear at that position. We also extracted the profile for these 10 binding sites. Table 10 shows the binding sites for the 10 genes, their alignment, and the start positions and the distances between the sites (the difference of start positions). The smallest distance is 20 and the largest is 194. Since these are start positions, the variable gap range is obtained by subtracting the length (15) of UASH to obtain l = 20 – 15 = 5 and u = 194 – 15 = 179. Notice also how the alignment preserves the highly conserved GCSGC region in URS1H.
Table 10  UASH and URS1H Binding Sites
Genes   UASH   URS1H   Distance
Site  Pos  Site  Pos
ZIP1  GATTCGGAAGTAAAA  -42  ==TCGGCGGCTAAAT  -22  20
MEI4  TCTTTCGGAGTCATA  -121  ==TGGGCGGCTAAAT  -98  23
DMC1  TTGTGTGGAGAGATA  -175  AAATAGCCGCCCA==  -143  32
SPO13  TAATTAGGAGTATAT  -119  AAATAGCCGCCGA==  -100  19
MER1  GGTTTTGTAGTTCTA  -152  TTTTAGCCGCCGA==  -115  37
SPO16  CATTGTGATGTATTT  -201  ==TGGGCGGCTAAAA  -90  111
REC104  CAATTTGGAGTAGGC  -182  ==TTGGCGGCTATTT  -93  89
RED1  ATTTCTGGAGATATC  -355  ==TCAGCGGCTAAAT  -167  188
REC114  GATTTTGTAGGAATA  -288  ==TGGGCGGCTAACT  -94  194
MEK1  TCATTTGTAGTTTAT  -233  ==ATGGCGGCTAAAT  -150  83
Motif  NNDTBNGDWGDNNDH   ==WBRGCSGCYVW==   [5,179] We then searched for the structured motif in the upstream regions of all 5873 genes in the yeast genome. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream open reading frame (ORF). As a result, 5794 sequences with average length of 497 bases are left. By searching for the IUPAC pattern, we found 65 occurrences, including the 10 originally known sites, within 1 second. By searching for the profile with 5 as the number of core positions in each simple motif, λc = 0.6 as the core score threshold and λ = 0.8 as the total score threshold, we found 56 occurrences in 0.18 seconds. For each occurrence, we then extracted its actual sequence segment in the matching upstream regions. Since the structured motif represented by IUPAC symbols may be too general, for each matching segment, we calculated its hamming distance to one of the 10 known binding sites. We then selected an occurrence as a possibly new binding site if the minimum hamming distance to any of the 10 known sites is within a given maximum threshold value. Table 11 lists the 12 newly found occurrences in the entire yeast genome (upstream regions) using a hamming distance threshold of 5. The sites discovered using both pattern and profile search are listed. These new occurrences could be putative binding sites for the two transcription factors UASH and URS1H.
Table 11  Potential Binding Sites
Genes   UASH   URS1H   Hamming Distance
Site  Pos  Site  Pos
MES1  GATTTTGAAGTAGGA  -438  TTAGCCGCCGA  -246  5:MER1
YJL045W  TTTTGTGAAGAGATA  -407  TTAGCCGCTCA  -273  4:DMC1
HSP60  GTTTTTGTAGGTATA  -329  ATAGCCGCCCA  -252  5:MER1
SPO1  ATTTTTGAAGTTAAC  -192  TCAGCGGCTAT  -90  5:RED1
MEK1  TCATTTGTAGTTTAT  -233  TCGGCGGCTAT  -136  3:MEK1
YIG1  ATTTCCGGAGTTTTC  -183  TCGGCGGCTAT  -140  5:RED1
†AGP1  CCTTTTGATGACTTT  -786  TCGGCGGCTAA  -699  5:SPO16
†AGPl  CCTTTTGATGACTTT  -786  TCGGCGGCTAA  -668  5:SPO16
†REC114  CATTTTGGTGGGTTC  -158  TGGGCGGCTAA  -94  5:SPO16
†GNTl  TCATTTGGAGAATAT  -340  ATAGCCGCCAT  -299  5:SPO13
‡MEK1  TTATATGCAGTATAT  -276  ATGGCGGCTAA  -150  4:MEK1
‡MMS1  AACTCTGTAGTTATA  -643  TGGGCGGCTAA  -497  5:REC114
For each occurrence we give the gene names corresponding to the upstream region, the sites and positions for UASH and URS1H, and also the hamming distance and the closest known gene with the cooperative binding sites. For example, 5:SPO16 in the first row means that the hamming distance between AGP1 and SPO16 was 5. †: found only by IUPAC pattern search, ‡: found only by profile search. Upon further analysis, we found that in fact, the new occurrence in MEK1 (at positions -233,-136) that we found is also listed in the SCPD database as a binding site. SCPD lists one site for UASH at position -233, and two sites for URS1H at positions -136 and -150. To construct the motif, we had used -150 as the site for URS1H, without knowledge of the other site. SMOTIF was thus able to automatically find the other site based on the extracted motif! For REC114, we also found another occurrence at positions -158 (UASH) and -94 (URS1H). However, this is not reported in SCPD.
To further analyze the remaining new occurrences, we consulted the SGD (Saccharomyces Genome Database) Gene Ontology Term Finder [29] to find the inter-relationships between the genes. The first three rows of Table 12 show the significant GO terms (biological process or molecular function) that are common to the genes corresponding to a new occurrence and its closest (known) gene. The rest of the table shows the significant terms among the 18 genes. These results indicate that at least some of the new occurrences (such as SPO1, HSP60, MES1, and GNT1) have a potential to be binding sites since they share some significant processes with the known sites' genes. Out of these SPO1 has the highest potential to be a new binding site, since it is known that UASH and URS1H are involved in early meiotic expression, during sporulation [28]. Table 12 shows that SPO1 shares meiosis and M phase of meiotic cell cycle with the rest of the genes. After searching for SPO1 in SGD database, we found that SPO1 is a transcriptional regulator involved in sporulation, and required for middle and late meiotic expression. This increases our confidence that SPO1 has high potential to be a previously unknown binding site.
Table 12  Genes and Significant Gene Ontology (GO) Terms
Genes   Significant GO Terms   p-value
MES1, MER1  RNA metabolism  5.7e-3
HSP60, MER1  nucleic acid binding  7.9e-3
SPO1, RED1  M phase-meiotic cell cycle, meiotic cell cycle, meiosis, M phase  1.1e-3
‡ MMS1, REC114  DNA recombination, DNA metabolism  6.0e-3
MES1, REC114, † GNT1, MEK1, MEI4, DMC1, MER1, REC104  biopolymer metabolism, macromolecule metabolism  2.3e-4
REC114, SPO1, MEK1, ZIP1, MEI4, DMC1, SPO13, MER1, REC104, RED1, HOP1  meiosis, M phase of meiotic cell cycle, meiotic cell cycle, M phase, cell cycle  1.6e-14
HSP60, DMC1, RED1, HOP1  DNA binding  6.7e-7
HSP60, DMC1, RED1  structure-specific DNA binding  3.1e-8
HSP60, DMC1  single-stranded DNA binding  3.7e-6
MEI4, DMC1, REC104, REC114, ‡ MMS1  DNA recombination, DNA metabolism  2.8e-6
MEI4, DMC1, MER1, REC104, REC114, MEK1, MES1, ‡ MMS1  biopolymer metabolism, macromolecule metabolism  2.3e-4
† : found only by IUPAC pattern. search, ‡ : found only by profile search. Finally, since we knew that in gene HOP1, the URS1H binding site appears upstream from UASH, we wanted to see if we could extract the "reversed" binding site. We search for the original and the reversed motifs using a hamming threshold of 6. We found 34 new binding sites where UASH can appear either up-or down-stream from URS1H. Among these we found two possible potential binding sites for the gene HOP1, with UASH at position -201 and URS1H at positions -534 and -175. The former pair (-201,-534) is in fact a known binding site as reported in the SCPD database [1]. This once again showcases the ability of SMOTIF to find potential new binding sites.