PMC:1570465 / 41734-45094 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"16916460-12520024-1687088","span":{"begin":190,"end":192},"obj":"12520024"},{"id":"16916460-8211139-1687089","span":{"begin":222,"end":224},"obj":"8211139"},{"id":"16916460-10743561-1687090","span":{"begin":255,"end":257},"obj":"10743561"},{"id":"16916460-8520488-1687091","span":{"begin":282,"end":284},"obj":"8520488"},{"id":"16916460-8211139-1687092","span":{"begin":1628,"end":1630},"obj":"8211139"},{"id":"16916460-10743561-1687093","span":{"begin":1631,"end":1633},"obj":"10743561"},{"id":"16916460-8520488-1687094","span":{"begin":1634,"end":1636},"obj":"8520488"},{"id":"16916460-8211139-1687095","span":{"begin":1747,"end":1749},"obj":"8211139"},{"id":"16916460-8211139-1687096","span":{"begin":1889,"end":1891},"obj":"8211139"},{"id":"16916460-10743561-1687097","span":{"begin":2730,"end":2732},"obj":"10743561"},{"id":"16916460-10743561-1687098","span":{"begin":3072,"end":3074},"obj":"10743561"},{"id":"16916460-10743561-1687099","span":{"begin":3355,"end":3357},"obj":"10743561"}],"text":"Protein motif finding\nWe study the performance of LP/DEE on a number of protein datasets with different characteristics (summarized in Table 1). The datasets are constructed from SwissProt [29], using the descriptions of [15] for the first two datasets, [36] for the next two, and [43] for the last one. These datasets are highly variable in the number and length of their protein sequences, as well as in the degree of motif conservation. The motif length parameters are set based on the lengths described by the above authors, and the BLOSUM62 substitution matrix is used for all reported results.\nTable 1 Descriptions of protein datasets. # Seq. gives the number of input protein sequences. Length gives the length of the protein motif searched for. |V| gives the number of vertices in the original graph constructed from the dataset. DEE gives the methods used to prune the graph, and are denoted by (1) clique-bounds DEE, (2) tighter constrained bounds and (3) graph decomposition. |VDEE| is the number of vertices in the graph after pruning. E-value lists the e-value of the motif found by the LP/DEE algorithm.\nDataset # Seq. Length |V| DEE |VDEE| E-value\nLipocalin 5 16 844 (1) 5 3.80 × 10-16\nHelix-Turn-Helix 30 20 6870 (1,2,3) 260 3.88 × 10-67\nTumor Necrosis Factor 10 17 2329 (1) 10 1.50 × 10-40\nZinc Metallopeptidase 10 12 7761 (1,2) 10 5.82 × 10-23\nImmunoglobulin Fold 18 10 7498 (1,2,3) 187 3.04 × 10-24 For each of the test protein datasets, our approach uncovers the optimal solution according to the SP-measure. These discovered motifs correspond to those reported by [15,36,43], and their SP-scores are highly significant, with e-values less than 10-15 for all of them. As described by [15], the HTH dataset is very diverse, and the detection of the motif is a difficult task. Nonetheless, our HTH motif is identical to that of [15], and agrees with the known annotations in every sequence. We likewise find the lipocalin motif; it is a weak motif with few generally conserved residues that is in perfect correspondence with the known lipocalin signature. We also precisely recover the immunoglobulin fold, TNF and zinc metallopeptidase motifs. The protein datasets demonstrate the strength of our graph pruning techniques. The five datasets are of varying difficulty to solve, with some employing the basic clique-bounds DEE technique to prune the graphs, while others requiring more elaborate pruning that is constrained by three-way alignments (see Table 1). In each case, the size of the reduced graph is at least an order of magnitude smaller. For three of the five datasets, the pruning procedures alone are able to identify the underlying motifs.\nIn contrast to [36], who limit sequence lengths to 500, we retain the original protein sequences, making the problem more difficult computationally. For example, the average sequence length in the zinc metallopeptidase dataset is approximately 800, and some sequences are as long as 1300 residues. The motif we recover is identical to the motif reported by [36] in nine of ten sequences (see Additional Table 1); yet, with the difference in the last sequence, the motif discovered by our method is superior both in terms of sequence conservation and statistical significance (with an e-value of 5.7729 × 10-23 for us vs 1.12155 × 10-21 for [36])."}