PMC:1570465 / 41367-65335
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"16916460-12520024-1687088","span":{"begin":557,"end":559},"obj":"12520024"},{"id":"16916460-8211139-1687089","span":{"begin":589,"end":591},"obj":"8211139"},{"id":"16916460-10743561-1687090","span":{"begin":622,"end":624},"obj":"10743561"},{"id":"16916460-8520488-1687091","span":{"begin":649,"end":651},"obj":"8520488"},{"id":"16916460-8211139-1687092","span":{"begin":1995,"end":1997},"obj":"8211139"},{"id":"16916460-10743561-1687093","span":{"begin":1998,"end":2000},"obj":"10743561"},{"id":"16916460-8520488-1687094","span":{"begin":2001,"end":2003},"obj":"8520488"},{"id":"16916460-8211139-1687095","span":{"begin":2114,"end":2116},"obj":"8211139"},{"id":"16916460-8211139-1687096","span":{"begin":2256,"end":2258},"obj":"8211139"},{"id":"16916460-10743561-1687097","span":{"begin":3097,"end":3099},"obj":"10743561"},{"id":"16916460-10743561-1687098","span":{"begin":3439,"end":3441},"obj":"10743561"},{"id":"16916460-10743561-1687099","span":{"begin":3722,"end":3724},"obj":"10743561"},{"id":"16916460-10854408-1687100","span":{"begin":3892,"end":3893},"obj":"10854408"},{"id":"16916460-9813115-1687101","span":{"begin":3894,"end":3896},"obj":"9813115"},{"id":"16916460-15297295-1687102","span":{"begin":3916,"end":3918},"obj":"15297295"},{"id":"16916460-9813115-1687103","span":{"begin":4284,"end":4286},"obj":"9813115"},{"id":"16916460-15637633-1687104","span":{"begin":7532,"end":7533},"obj":"15637633"},{"id":"16916460-10977088-1687105","span":{"begin":7534,"end":7536},"obj":"10977088"},{"id":"16916460-15637633-1687106","span":{"begin":8165,"end":8166},"obj":"15637633"},{"id":"16916460-12824370-1687107","span":{"begin":8623,"end":8625},"obj":"12824370"},{"id":"16916460-12015879-1687108","span":{"begin":8644,"end":8646},"obj":"12015879"},{"id":"16916460-15637633-1687109","span":{"begin":8880,"end":8881},"obj":"15637633"},{"id":"16916460-15215380-1687110","span":{"begin":8936,"end":8938},"obj":"15215380"},{"id":"16916460-15637633-1687111","span":{"begin":10563,"end":10564},"obj":"15637633"},{"id":"16916460-3525846-1687112","span":{"begin":11102,"end":11104},"obj":"3525846"},{"id":"16916460-11997340-1687113","span":{"begin":20327,"end":20328},"obj":"11997340"},{"id":"16916460-11997340-1687114","span":{"begin":20708,"end":20709},"obj":"11997340"},{"id":"16916460-8594589-1687115","span":{"begin":21208,"end":21210},"obj":"8594589"},{"id":"16916460-11997340-1687116","span":{"begin":21286,"end":21287},"obj":"11997340"},{"id":"16916460-11997340-1687117","span":{"begin":21323,"end":21324},"obj":"11997340"},{"id":"16916460-15961477-1687118","span":{"begin":21889,"end":21891},"obj":"15961477"},{"id":"16916460-11997340-1687119","span":{"begin":23547,"end":23548},"obj":"11997340"}],"text":"Experimental results\nWe apply our LP/DEE approach to several motif finding problems. We attempt to discover motifs in instances arising from both DNA and protein sequence data, and compare them with known motifs and those found by other motif finding methods. We then consider the phylogenetic footprinting problem, and demonstrate the discovery of multiple motifs.\n\nProtein motif finding\nWe study the performance of LP/DEE on a number of protein datasets with different characteristics (summarized in Table 1). The datasets are constructed from SwissProt [29], using the descriptions of [15] for the first two datasets, [36] for the next two, and [43] for the last one. These datasets are highly variable in the number and length of their protein sequences, as well as in the degree of motif conservation. The motif length parameters are set based on the lengths described by the above authors, and the BLOSUM62 substitution matrix is used for all reported results.\nTable 1 Descriptions of protein datasets. # Seq. gives the number of input protein sequences. Length gives the length of the protein motif searched for. |V| gives the number of vertices in the original graph constructed from the dataset. DEE gives the methods used to prune the graph, and are denoted by (1) clique-bounds DEE, (2) tighter constrained bounds and (3) graph decomposition. |VDEE| is the number of vertices in the graph after pruning. E-value lists the e-value of the motif found by the LP/DEE algorithm.\nDataset # Seq. Length |V| DEE |VDEE| E-value\nLipocalin 5 16 844 (1) 5 3.80 × 10-16\nHelix-Turn-Helix 30 20 6870 (1,2,3) 260 3.88 × 10-67\nTumor Necrosis Factor 10 17 2329 (1) 10 1.50 × 10-40\nZinc Metallopeptidase 10 12 7761 (1,2) 10 5.82 × 10-23\nImmunoglobulin Fold 18 10 7498 (1,2,3) 187 3.04 × 10-24 For each of the test protein datasets, our approach uncovers the optimal solution according to the SP-measure. These discovered motifs correspond to those reported by [15,36,43], and their SP-scores are highly significant, with e-values less than 10-15 for all of them. As described by [15], the HTH dataset is very diverse, and the detection of the motif is a difficult task. Nonetheless, our HTH motif is identical to that of [15], and agrees with the known annotations in every sequence. We likewise find the lipocalin motif; it is a weak motif with few generally conserved residues that is in perfect correspondence with the known lipocalin signature. We also precisely recover the immunoglobulin fold, TNF and zinc metallopeptidase motifs. The protein datasets demonstrate the strength of our graph pruning techniques. The five datasets are of varying difficulty to solve, with some employing the basic clique-bounds DEE technique to prune the graphs, while others requiring more elaborate pruning that is constrained by three-way alignments (see Table 1). In each case, the size of the reduced graph is at least an order of magnitude smaller. For three of the five datasets, the pruning procedures alone are able to identify the underlying motifs.\nIn contrast to [36], who limit sequence lengths to 500, we retain the original protein sequences, making the problem more difficult computationally. For example, the average sequence length in the zinc metallopeptidase dataset is approximately 800, and some sequences are as long as 1300 residues. The motif we recover is identical to the motif reported by [36] in nine of ten sequences (see Additional Table 1); yet, with the difference in the last sequence, the motif discovered by our method is superior both in terms of sequence conservation and statistical significance (with an e-value of 5.7729 × 10-23 for us vs 1.12155 × 10-21 for [36]).\n\nDetecting bacterial regulatory elements\nWe apply our method to identify the binding sites of 36 E.coli regulatory proteins. We construct our dataset from that of [6,28], as described in [32]. For each binding site, we locate it within the genome and extract up to 600 bp of DNA sequence upstream from the gene it regulates. We remove binding sites for sigma factors, binding sites for transcription factors with fewer than three known sites, and those that could not be unambiguously located in the genome. Motif length parameters are set as reported by [28], except for crp, where a length of 18 instead of 22 is used. Background nucleotide frequencies are computed using the upstream regions for each dataset individually. The final dataset consists of 36 transcription factors, each regulating between 3 and 33 genes, with binding site length ranging between 11 and 48 (see Table 2).\nTable 2 Listing of the transcription factors' datasets (columns 1, 2, and 3) and the results of motif finding by LP/DEE. TF is the transcription factor dataset. Seq is the number of input sequences. Len is the length of the motif searched for. The rest of the listed measures refer to the motifs discovered by the LP/DEE algorithm: IC is the average per-column information content [44]; RE is the average per-column relative entropy; E-value is the e-value, computed according to our statistical significance assessment; nPC is the nucleotide level performance coefficient; and sSn is the site level sensitivity. The four starred entries indicate potentially non-optimal solutions; entries marked with † indicated usage of the ILP solver.\nTF Seq Len IC RE E-value nPC sSn\nada 3 31 1.3000 1.0846 9.16 × 10-1 0.1341 0.33\naraC 4 48 1.1437 0.9940 1.15 × 10-3 0.3474 0.50\narcA 11 15 1.2505 1.1992 4.31 × 10-6 0.4224 0.73\nargR 8 18 1.2990 1.2149 1.30 × 10-7 0.2857 0.50\ncpxR 7 15 1.3290 1.2337 1.09 × 10-5 0.5556 0.71\ncrp*† 33 18 0.7196 0.7045 3.08 × 10-9 0.5570 0.76\ncytR 5 18 1.2317 1.1069 2.48 × 10-1 0.0588 0.20\ndnaA 6 15 1.4535 1.3300 6.12 × 10-6 1.0000 1.00\nfadR 5 17 1.3466 1.2074 1.33 × 10-2 0.5455 0.80\nfis* 8 35 0.8927 0.8376 1.37 × 10-6 0.1966 0.38\nflhCD 3 31 1.3942 1.1656 4.79 × 10-3 0.0000 0.00\nfnr 10 22 1.1025 1.0476 1.85 × 10-9 0.6176 0.80\nfruR 10 16 1.2094 1.1491 5.52 × 10-8 0.8182 0.90\nfur 7 18 1.3285 1.2332 1.28 × 10-8 0.4237 0.71\ngalR 7 16 1.5445 1.4347 1.52 × 10-16 0.5034 0.71\nglpR 4 20 1.4227 1.2441 2.63 × 10-2 0.5534 0.75\nhns 5 11 1.5175 1.3660 2.25 0.0000 0.00\nihf* 19 48 0.3932 0.3859 2.26 × 10+8 0.0381 0.16\nlexA 17 20 1.1481 1.1192 1.01 × 10-40 0.7215 0.88\nlrp 4 25 1.2879 1.1237 6.44 × 10-2 0.0989 0.25\nmalT 6 10 1.5071 1.3815 1.73 × 10-1 0.0000 0.00\nmetJ 5 16 1.6842 1.5195 3.37 × 10-12 0.6495 1.00\nmetR 6 15 1.3097 1.1970 6.57 × 10-2 0.0000 0.00\nmodE 3 24 1.5618 1.3145 3.95 × 10-4 1.0000 1.00\nnagC 5 23 1.2795 1.1462 1.03 × 10-3 0.0360 0.20\nnarL 10 16 1.1391 1.0828 8.06 × 10-4 0.8182 0.90\nnarP 4 16 1.4534 1.2737 7.48 × 10-4 0.0000 0.00\nntrC 4 17 1.6621 1.4605 1.28 × 10-8 0.6386 1.00\nompR 4 20 1.3566 1.1860 4.27 × 10-6 0.0000 0.00\noxyR 4 39 1.0965 0.9521 2.64 0.0796 0.25\nphoB 8 22 1.1567 1.0835 4.14 × 10-9 0.8051 1.00\npurR 20 26 0.8305 0.8147 1.53 × 10-37 0.7247 0.95\nsoxS*† 11 35 0.7771 0.7453 1.26 × 10-9 0.0815 0.27\ntrpR 4 24 1.4069 1.2291 3.74 × 10-6 0.8462 1.00\ntus 5 23 1.5839 1.4276 1.05 × 10-17 0.8400 1.00\ntyrR 10 22 1.0693 1.0159 3.63 × 10-9 0.5017 0.70 We evaluate the overlap between motif predictions made by our approach and the known motifs using the nucleotide level performance coefficient (nPC) [1,17]. Let nTP, nFP, nTN, nFN refer to nucleotide level true positives, false positives, true negatives and false negatives respectively. For example, nTP is the number of nucleotides in common between the known and predicted motifs. The nPC is defined as nTP/(nTP + nFN + nFP); it is a stringent statistic, penalizing a method for both failing to identify any nucleotide belonging to the motif as well as falsely predicting any nucleotide not belonging to the motif. Though nPC takes both false positives and false negatives at the nucleotide level into account, we also find it useful to consider site level statistics. Following [1], we consider two sites to be overlapping if they overlap by at least one-quarter the length of the site. Defining site level statistics similarly to the nucleotide level statistics above (e.g., site level true positives, sTP, is the number of known sites overlapped by predicted sites), site level sensitivity sSn is sTP/(sTP + sFN).\n\nMotif finding in genomic data\nWe compare the performance of our method to three others, MEME [18], Gibbs Motif Sampler [39], and Projection [22]. We choose these particular methods as they are widely-used, readily accessibly for download via the internet, and can handle the lengths of motifs (11–48 bps) in our dataset. While a recent performance evaluation of motif finders [1] indicates that combinatorial methods such as Weeder [24] have somewhat better performance than other methods, most enumerative combinatorial methods, Weeder included, are not able to handle the lengths of the motifs in our bacterial dataset.\nTwo of the methods we compare against, MEME and Gibbs Motif Sampler, are stochastic-search based algorithms. We run them requiring one motif instance per sequence and limiting the search to the primary sequence strand only, while leaving other parameters at their defaults. Gibbs Motif Sampler is run with 100 random restarts to allow for sufficient sampling of the search space, and MEME is allowed to execute its own algorithm to search the dataset for good starting points for EM. Note that Gibbs Motif Sampler failed to execute on three largest datasets, crp, ihf and purr when run on our local linux machines; these datasets were submitted through the web server.\nProjection method has a combinatorial component, combining the idea of locality-sensitive hashing with postprocessing by MEME, and unlike most enumerative combinatorial methods, Projection is not limited by motif length. Since many of the motifs in our dataset are not well conserved, we set the d parameter, assessing the average number of mismatches per motif instance, to it maximum suggested value of 0.25 of the motif length. Length parameters of the known motif are used for each dataset.\nWe detail the performance of our algorithm on the full set of binding sites for 36 E. coli transcription factors in Table 2. Considering a motif to have been correctly identified if at least half of its sites were found with at least 25% overlap with the known site (as in [1], and essentially corresponding to datasets with sSn of at least 0.5), our method accurately discovers motifs for 22 transcription factors. Setting an e-value threshold at 1.0 (a lower threshold causes other methods to identify too few datasets as having significant motifs), we find statistically significant motifs for 33 datasets. Of the three transcription factor datasets with no significant solutions, one, hns, is a short motif, and the other two, oxyr and ihf, are very poorly conserved motifs with low information content (IC) [44], such that the average per-column IC for the known oxyr and ihf motifs are 0.89 and 0.36, respectively, whereas these values are 0.95 and 0.39 for their discovered motifs. In general, as compared to the motifs corresponding to the actual known transcription factor binding sites, the motifs found by our method exhibit equal or higher IC, measuring motif conservation in isolation from background sequence, as well as higher relative entropy, measuring the difference with the background distribution (Table 2).\nMEME reports motifs for all 36 transcription factor datasets, with e-values less than 1.0 for 20 of them. Gibbs Motif Sampler discovers motifs for 34 of the transcription factors, with 15 of them considered significant via their positive logMAP scores (no motifs are found for araC and flhCD). Projection reports motifs with no significance assessment for all 36 transcription factor datasets. The LP/DEE approach described in this paper has the best overall performance. Taking significance assessment into account, and considering all datasets with no significant motifs to have zero sSn and nPC values, our method produces 0.554 averaged sSn and 0.411 nPC values. Indeed, only two datasets, oxyr and ihf, have motifs that are deemed insignificant using our scheme yet have non-zero overlap with the actual motifs. Performance statistics for MEME and Gibbs Motif Sampler are considerably lower with the averaged sSn of 0.338 and nPC of 0.257 for Gibbs, and corresponding sSn and nPC values of 0.382 and 0.285 for MEME. Since Projection does not report significance values, we also note averages of raw coefficients for overlap with the known motifs while ignoring significance assessments. Our method still outperforms the others, though not as significantly, with sSn and nPC values of 0.565 and 0.414 for LP/DEE; 0.550 and 0.402 for Gibbs; 0.501 and 0.358 for MEME; 0.560 and 0.407 for Projection.\nWe also show sSn and nPC values while ignoring significance for each of the three other methods compared to LP/DEE in Figure 1, only displaying transcription factor datasets for which a difference in performance is observed. Each bar in the chart measures the difference in sSn (Figures 1(a)–1(c)) or nPC (Figures 1(d)–1(f)) between our method and one of the other methods. Using both the sSn and the nPC statistics, LP/DEE performs better than any of the three other approaches in identifying known binding sites. For example, for LP/DEE versus MEME, very large differences are observed for three transcription factors, with our method identifying narL, glpR, and ntrC motifs almost completely, and MEME entirely misidentifying them. Moreover, the LP/DEE method exhibits better performance on more transcription factor datasets than the other methods. For example, considering nPC, LP/DEE performs better than MEME on eleven datasets, and worse than it on six datasets (Figure 1(e)). Differences in performance with Gibbs Motif Sampler and Projection are smaller; for instance, the LP/DEE method exhibits better performance than Projection using the sSn statistic on six datasets versus worse than it on two datasets (Figure 1(c)). We note that if significance assessments are included and motifs with e-value greater than 1.0 are discarded (see Additional Figures 1(a) and 1(b)), then LP/DEE has better nPC than MEME on 16 datasets, and worse nPC on three datasets, suggesting that MEME's significance computation is unnecessarily conservative for our dataset; the same applies to Gibbs Motif Sampler as well.\nFigure 1 Performance comparison for the LP/DEE method with Gibbs Motif Sampler, MEME and Projection. Performance comparison of the LP/DEE method with Gibbs Motif Sampler, MEME, and Projection when identifying E. coli regulatory sites. Performance is given in terms of the site level sensitivity 1(a)-1(c) and nucleotide performance coefficient 1(d)-1(f). Significance assessment is disregarded. For every transcription factor dataset, the height of the bar indicates the difference in the metric, with bars above zero specifying better performance for LP/DEE and bars below zero otherwise. Plotted are only those datasets for which there is a difference in performance between the pair of methods being compared. Our approach finds provably optimal solutions for 32 of the 36 datasets. Notwithstanding, our method also exhibits excellent runtimes for most problems. Of the 36 transcription factors we considered, 25 were solved in seconds with the application of clique-bounds DEE, some using tighter bounds constrained by three-way alignments. Seven required the application of graph decomposition with tighter clique-bounds DEE, and took a few minutes to three hours to solve. For the remaining four datasets, we used more computationally intensive speculative pruning. Interestingly, we found highly significant solutions for three of these datasets, albeit without the guarantee of optimality, and no significant solution for one. For each of them we provide a bound on the significance value of a potential optimal solution according to the method detailed in the section above. The e-values of the obtained motif and the lower bound in parentheses are: crp, 3.08 × 10-9 (6.04 × 10-33); fis, 1.37 × 10-6 (2.29 × 10-7); soxS, 1.26 × 10-9 (5.25 × 10-14); and ihf, 2.26 × 10+8 (3.98 × 10-31). Finally, in the entire data collection, all but two of the problems resulted in integral solutions to their LPs. The remaining instances with fractional solutions were easily solved by the ILP solver.\n\nSimulated data\nWe also evaluate the effectiveness of our scoring scheme in finding binding sites for five regulatory proteins when they are embedded in simulated data. Our goals are twofold. First, since our underlying scoring measure is based on counting matches between nucleotides, it is important to see how well it performs in compositionally biased backgrounds. In the E. coli dataset, even a simple scoring scheme that assigns a score of 1 to matches and 0 to mismatches performs competitively (data not shown). However, since other genomes can have considerably more biased nucleotide compositions, our scoring scheme rewards matches between more rare nucleotides, and we test here how it performs in different scenarios. Second, while it is essential to test the performance of motif finding algorithms on genomic data (as above), it is possible that there are other conserved motifs in the data, besides those with which we are evaluating performance, and these conserved motifs lead to lower nPC and sSn measurements. Simulated data is not expected to have other conserved motifs, and thus provides a cleaner, though perhaps optimistic, means for testing motif finding approaches.\nIn our testing on simulated data, we use a selection of five transcription factor datasets with motifs of varying levels of conservation, as measured by their IC (Table 3). We generate background sequences with uniform nucleotide distributions, as well as those with increasingly biased probability distributions. A background sequence for a particular binding site is generated of length equal to that of its upstream region (up to 600 bps). In particular, for each position, a base is selected at random according to a probability distribution in which base G is chosen with some probability pr(G) and the other bases with probability (1 - pr(G))/3 each.\nTable 3 Scoring method evaluation in terms of performance coefficient in biased-composition simulated data. Performance of LP/DEE in biased-composition simulated data. The first column identifies the TF dataset. The second column measures the degree of conservation of the known motif, as measured by average per-column information content [44]. The rest of the columns list the nucleotide performance coefficient of the LP/DEE method with the probability of base G indicated in the column heading and the frequencies of all other bases split equally.\nTF IC Bias\n0.25 0.5 0.75 0.9\naraC 1.00041 0.8113 0.9592 0.9592 0.9592\ncpxR 1.17034 1.0000 0.8261 0.9811 0.9811\ndnaA 1.45351 1.0000 0.7647 0.7647 1.0000\ngalR 1.34756 0.8824 0.8824 1.0000 1.0000\nnarP 1.40273 1.0000 1.0000 1.0000 1.0000 Our nPC performance is summarized in Table 3 for various background distributions. We find motifs of very high nPC values in varying biased nucleotide composition, attesting to the fact that our scoring scheme is successfully able to correct for bias in sequence composition. Moreover, as expected, performance on simulated data is better than that for actual genomic sequence. In the narP dataset, for example, the motif is found perfectly in simulated data and not at all in real genomic data. Additionally, an alternate highly conserved site is found by all four methods in genomic data (Table 2), suggesting that while the narP site is well-conserved, the corresponding genomic sequences contain another shared motif of higher conservation.\n\nPhylogenetic footprinting\nWe also apply our approach to identify motifs among sets of upstream regions of orthologous genes in a number of genomes. Here, the relationships between genomes is incorporated via weighting of the components of the SP-score. The eight datasets come from [9]. All datasets contain vertebrate sequences; some (Interleukin-3 and Insulin datasets) consist of only mammalian genomes, while others contain members from more diverse animal phyla. The number of sequences in the datasets ranges between 4 and 16, and most sequences are shorter than 1000 residues in length.\nWe use the phylogenetic trees (topology and branch lengths) given in [9] to derive the pairwise weights, and use the motif lengths provided. For each of the eight datasets, our approach identifies the optimal motif using the SP scoring measure (Table 4). The consensus sequences for the discovered motifs are listed in Table 4 along with the description of their DNA regions. (The motif reported for the c-fos promoter dataset was discovered second, after having discarded the poly-A repeat region.) All the motifs we find have been documented in the TRANSFAC database [45], and the majority of them correspond to those that have been reported by [9]. Two motifs differ from those of [9]: the first, a c-fos motif, shares its consensus sequence with a known c-fos regulatory element, the binding site of the serum response factor (SRF) protein (accession number R02246). The second, a c-myc motif, also corresponds to a known c-myc binding site in the P1 promoter (accession number R04621). The e-values of the found motifs range from 10-18 to 10-5. We note that though the notion of significance according to our method merely rejects the hypothesis that all the motif instances are unrelated, and a scheme that takes phylogeny into account such as [46] may be better suited for this problem in general, our significance evaluation attests to the presence of a highly conserved motif instance in every input sequence.\nTable 4 Motifs identified with use of phylogenetic information. Listing of motifs and details of their host sequences for phylogenetic motif finding. All datasets tested are from [9]. DNA region details the DNA regions considered (PR signifies promoter region). # Seq. gives the number of input sequences. Motif (id) identifies the consensus sequence of the discovered motif and its correspondence with the motifs of [9] where applicable. All listed motifs have been documented as regulatory elements in TRANSFAC [45]. For datasets other than the insulin dataset, only the best motif is reported and for the insulin dataset multiple motifs are reported in order of discovery.\nDNA region # Seq. Motif (id)\nGrowth-horm. 5' UTR + PR (380 bp) 16 TATAAAAA (7)\nHistone H1 5' UTR + PR (650 bp) 4 AAACAAAAGT (2)\nC-fos 5' UTR + PR (800 bp) 6 CCATATTAGG\nC-fos first intron (376 to 758 bp) 7 AGGGATATTT (3)\nInterleukin-3 5' UTR + PR (490 bp) 6 TGGAGGTTCC (3)\nC-myc second intron (971 to 1376 bp) 6 TTTGCAGCTA (5)\nC-myc 5' PR (1000 bp) 7 GCCCCTCCCG\nInsulin family 5' PR (500 bp) 8 GCCATCTGCC (2)TAAGACTCTA (1)CTATAAAGCC (3)CAGGGAAATG (4) This dataset is also an excellent testing ground for finding distinct multiple motifs using our method. We iteratively identify motifs and remove their corresponding vertices from the constructed graphs. As proof of principle, we find multiple motifs for the insulin dataset. In this case, we successfully identify all four motifs reported by [9]. Since our objective function differs from theirs and we require motif occurrences in every input sequence, we recover the motifs in a different order. Of course, we identify numerous shifts of these motifs in successive iterations. In practice, therefore, it may be more beneficial to remove a number of vertices corresponding to subsequences overlapping the optimal solution before attempting to find the next motif."}