PMC:1570465 / 61411-65335
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"16916460-11997340-1687113","span":{"begin":283,"end":284},"obj":"11997340"},{"id":"16916460-11997340-1687114","span":{"begin":664,"end":665},"obj":"11997340"},{"id":"16916460-8594589-1687115","span":{"begin":1164,"end":1166},"obj":"8594589"},{"id":"16916460-11997340-1687116","span":{"begin":1242,"end":1243},"obj":"11997340"},{"id":"16916460-11997340-1687117","span":{"begin":1279,"end":1280},"obj":"11997340"},{"id":"16916460-15961477-1687118","span":{"begin":1845,"end":1847},"obj":"15961477"},{"id":"16916460-11997340-1687119","span":{"begin":3503,"end":3504},"obj":"11997340"}],"text":"Phylogenetic footprinting\nWe also apply our approach to identify motifs among sets of upstream regions of orthologous genes in a number of genomes. Here, the relationships between genomes is incorporated via weighting of the components of the SP-score. The eight datasets come from [9]. All datasets contain vertebrate sequences; some (Interleukin-3 and Insulin datasets) consist of only mammalian genomes, while others contain members from more diverse animal phyla. The number of sequences in the datasets ranges between 4 and 16, and most sequences are shorter than 1000 residues in length.\nWe use the phylogenetic trees (topology and branch lengths) given in [9] to derive the pairwise weights, and use the motif lengths provided. For each of the eight datasets, our approach identifies the optimal motif using the SP scoring measure (Table 4). The consensus sequences for the discovered motifs are listed in Table 4 along with the description of their DNA regions. (The motif reported for the c-fos promoter dataset was discovered second, after having discarded the poly-A repeat region.) All the motifs we find have been documented in the TRANSFAC database [45], and the majority of them correspond to those that have been reported by [9]. Two motifs differ from those of [9]: the first, a c-fos motif, shares its consensus sequence with a known c-fos regulatory element, the binding site of the serum response factor (SRF) protein (accession number R02246). The second, a c-myc motif, also corresponds to a known c-myc binding site in the P1 promoter (accession number R04621). The e-values of the found motifs range from 10-18 to 10-5. We note that though the notion of significance according to our method merely rejects the hypothesis that all the motif instances are unrelated, and a scheme that takes phylogeny into account such as [46] may be better suited for this problem in general, our significance evaluation attests to the presence of a highly conserved motif instance in every input sequence.\nTable 4 Motifs identified with use of phylogenetic information. Listing of motifs and details of their host sequences for phylogenetic motif finding. All datasets tested are from [9]. DNA region details the DNA regions considered (PR signifies promoter region). # Seq. gives the number of input sequences. Motif (id) identifies the consensus sequence of the discovered motif and its correspondence with the motifs of [9] where applicable. All listed motifs have been documented as regulatory elements in TRANSFAC [45]. For datasets other than the insulin dataset, only the best motif is reported and for the insulin dataset multiple motifs are reported in order of discovery.\nDNA region # Seq. Motif (id)\nGrowth-horm. 5' UTR + PR (380 bp) 16 TATAAAAA (7)\nHistone H1 5' UTR + PR (650 bp) 4 AAACAAAAGT (2)\nC-fos 5' UTR + PR (800 bp) 6 CCATATTAGG\nC-fos first intron (376 to 758 bp) 7 AGGGATATTT (3)\nInterleukin-3 5' UTR + PR (490 bp) 6 TGGAGGTTCC (3)\nC-myc second intron (971 to 1376 bp) 6 TTTGCAGCTA (5)\nC-myc 5' PR (1000 bp) 7 GCCCCTCCCG\nInsulin family 5' PR (500 bp) 8 GCCATCTGCC (2)TAAGACTCTA (1)CTATAAAGCC (3)CAGGGAAATG (4) This dataset is also an excellent testing ground for finding distinct multiple motifs using our method. We iteratively identify motifs and remove their corresponding vertices from the constructed graphs. As proof of principle, we find multiple motifs for the insulin dataset. In this case, we successfully identify all four motifs reported by [9]. Since our objective function differs from theirs and we require motif occurrences in every input sequence, we recover the motifs in a different order. Of course, we identify numerous shifts of these motifs in successive iterations. In practice, therefore, it may be more beneficial to remove a number of vertices corresponding to subsequences overlapping the optimal solution before attempting to find the next motif."}