Sequence segmentation The SMOTIF approach as described above works well for searching a motif in a relatively short sequence. For a very long sequence S (e.g., searching for (LTR) retrotransposons in an entire chromosome) the pos-lists can get very long in the initial stages, consuming a lot of memory. SMOTIF handles a long sequence by splitting it into several segments and searches each segment separately for the structured motif. That is, the sequence S is split into p equal partitions (except for the last one). Handling each smaller segment Si (i ∈ [l, p]) instead of the original S can save a lot of space and also reduces the total search time. After segmentation, to avoid missing any occurrence, we require that each partition Si, with i ∈ [l, p - 1], include the first L - 1 symbols from partition Si+1. Finally, to avoid duplicate occurrences, we discard all occurrences with a start position in the overlap region, since it would be reported when we process segment Si+1. For example, let S be the sequence in Table 6, and let the structured motif be ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@ = GC[1,2]T with maximum span L = 5. If p = 3, then we would have three segments of length 6 each. After adding the overlap region of L - 1 = 4 positions at the end of each segment, we obtain the final three segments shown in Table 6. Two start positions of ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@ would be found in S1 (namely 1 and 5), and one in S2 (namely 11). Note that start positions 5 and 11 would have been missed if we had no overlap. So far we have assumed that we are searching for the structured motif in a single sequence. SMOTIF can easily handle a collection S MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFse=uaaa@3845@ of sequences. We simply search each sequence separately using segmentation when necessary. Table 6 Segmentation into p = 3 S G C A T G C G T T A G C A T C A T C S 1 G C A T G C G T T A S 2 G T T A G C A T C A S 3 A T C A T C