PMC:1679804 / 67531-70334 JSONTXT

Annnotations TAB JSON ListView MergeView

{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/1679804","sourcedb":"PMC","sourceid":"1679804","source_url":"https://www.ncbi.nlm.nih.gov/pmc/1679804","text":"Sequence segmentation\nThe SMOTIF approach as described above works well for searching a motif in a relatively short sequence. For a very long sequence S (e.g., searching for (LTR) retrotransposons in an entire chromosome) the pos-lists can get very long in the initial stages, consuming a lot of memory. SMOTIF handles a long sequence by splitting it into several segments and searches each segment separately for the structured motif. That is, the sequence S is split into p equal partitions (except for the last one). Handling each smaller segment Si (i ∈ [l, p]) instead of the original S can save a lot of space and also reduces the total search time. After segmentation, to avoid missing any occurrence, we require that each partition Si, with i ∈ [l, p - 1], include the first L - 1 symbols from partition Si+1. Finally, to avoid duplicate occurrences, we discard all occurrences with a start position in the overlap region, since it would be reported when we process segment Si+1. For example, let S be the sequence in Table 6, and let the structured motif be ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@ = GC[1,2]T with maximum span L = 5. If p = 3, then we would have three segments of length 6 each. After adding the overlap region of L - 1 = 4 positions at the end of each segment, we obtain the final three segments shown in Table 6. Two start positions of ℳ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFZestaaa@3790@ would be found in S1 (namely 1 and 5), and one in S2 (namely 11). Note that start positions 5 and 11 would have been missed if we had no overlap.\nSo far we have assumed that we are searching for the structured motif in a single sequence. SMOTIF can easily handle a collection S MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFse=uaaa@3845@ of sequences. We simply search each sequence separately using segmentation when necessary.\nTable 6 Segmentation into p = 3\nS G C A T G C G T T A G C A T C A T C\nS 1 G C A T G C G T T A\nS 2 G T T A G C A T C A\nS 3 A T C A T C","divisions":[{"label":"title","span":{"begin":0,"end":21}},{"label":"p","span":{"begin":22,"end":2077}},{"label":"p","span":{"begin":2078,"end":2602}},{"label":"label","span":{"begin":2603,"end":2610}},{"label":"caption","span":{"begin":2612,"end":2635}},{"label":"p","span":{"begin":2612,"end":2635}},{"label":"tr","span":{"begin":2636,"end":2692}},{"label":"td","span":{"begin":2636,"end":2638}},{"label":"td","span":{"begin":2640,"end":2641}},{"label":"td","span":{"begin":2643,"end":2644}},{"label":"td","span":{"begin":2646,"end":2647}},{"label":"td","span":{"begin":2649,"end":2650}},{"label":"td","span":{"begin":2652,"end":2653}},{"label":"td","span":{"begin":2655,"end":2656}},{"label":"td","span":{"begin":2658,"end":2659}},{"label":"td","span":{"begin":2661,"end":2662}},{"label":"td","span":{"begin":2664,"end":2665}},{"label":"td","span":{"begin":2667,"end":2668}},{"label":"td","span":{"begin":2670,"end":2671}},{"label":"td","span":{"begin":2673,"end":2674}},{"label":"td","span":{"begin":2676,"end":2677}},{"label":"td","span":{"begin":2679,"end":2680}},{"label":"td","span":{"begin":2682,"end":2683}},{"label":"td","span":{"begin":2685,"end":2686}},{"label":"td","span":{"begin":2688,"end":2689}},{"label":"td","span":{"begin":2691,"end":2692}},{"label":"tr","span":{"begin":2693,"end":2727}},{"label":"td","span":{"begin":2693,"end":2697}},{"label":"td","span":{"begin":2699,"end":2700}},{"label":"td","span":{"begin":2702,"end":2703}},{"label":"td","span":{"begin":2705,"end":2706}},{"label":"td","span":{"begin":2708,"end":2709}},{"label":"td","span":{"begin":2711,"end":2712}},{"label":"td","span":{"begin":2714,"end":2715}},{"label":"td","span":{"begin":2717,"end":2718}},{"label":"td","span":{"begin":2720,"end":2721}},{"label":"td","span":{"begin":2723,"end":2724}},{"label":"td","span":{"begin":2726,"end":2727}},{"label":"tr","span":{"begin":2728,"end":2768}},{"label":"td","span":{"begin":2728,"end":2732}},{"label":"td","span":{"begin":2740,"end":2741}},{"label":"td","span":{"begin":2743,"end":2744}},{"label":"td","span":{"begin":2746,"end":2747}},{"label":"td","span":{"begin":2749,"end":2750}},{"label":"td","span":{"begin":2752,"end":2753}},{"label":"td","span":{"begin":2755,"end":2756}},{"label":"td","span":{"begin":2758,"end":2759}},{"label":"td","span":{"begin":2761,"end":2762}},{"label":"td","span":{"begin":2764,"end":2765}},{"label":"td","span":{"begin":2767,"end":2768}},{"label":"td","span":{"begin":2769,"end":2773}},{"label":"td","span":{"begin":2787,"end":2788}},{"label":"td","span":{"begin":2790,"end":2791}},{"label":"td","span":{"begin":2793,"end":2794}},{"label":"td","span":{"begin":2796,"end":2797}},{"label":"td","span":{"begin":2799,"end":2800}}],"tracks":[]}