PMC:1459173 / 13802-17719 JSONTXT

Annnotations TAB JSON ListView MergeView

{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/1459173","sourcedb":"PMC","sourceid":"1459173","source_url":"https://www.ncbi.nlm.nih.gov/pmc/1459173","text":"Mining extensible motifs\nThe procedure of motif extraction that is described in Table 1 essentially constructs the inexact suffix tree of [1] implicitly, in a different order. The input is a string s of size n and two positive integers, K and D.\nThe extensibility parameter D is interpreted in the sense that up to D (or 1 to D) dot characters between two consecutive solid characters are allowed. The output is all maximal extensible (with D spacers) patterns that occur at least K times in s. Incidentally, the algorithm can be adapted to extract rigid motifs as a special case. For this, it suffices to interpret D as the maximum number of dot characters between two consecutive solid characters.\nThe algorithm works by converting the input into a sequence of possibly overlapping cells: A cell is the smallest substring in any pattern on s, that has exactly two solid characters: one at the start and the other at the end position of this substring. A maximal extensible pattern is a sequence of cells.\n\nInitialization phase\nThe cell is the smallest extensible component of a maximal pattern and the string can be viewed as a sequence of overlapping cells. If no don't care characters are allowed in the motifs then the cells are non-overlapping. The initialization phase has the following steps.\nStep 1: Construct patterns that have exactly two solid characters in them and separated by no more than D spaces or \".\" characters. This is done by scanning the string s from left to right. Further, for each location we store start and end position of the pattern. For example, if s = abzdabyxd and K = 2, D = 2, then all the patterns generated at this step are: ab, a.z, a..d, bz, b.d, b..a, zd, z.a, z..b, da, d.b, d..y, a.y, a..x, by, b.x, b..d, yx, y.d, xd, each with its occurrence list. Thus ab = {(1, 2), (5, 6)}, a.z = {(1, 3)} and so on.\nStep 2: The extensible cells are constructed by combining all the cells with at least one dot character and the same start and end solid characters. The location list is updated to reflect the start and end position of each occurrence. Continuing the previous example, b-d is generated at this step with b-d = {(2, 4), (6, 9)}. All cells m with |m| \u003cK are discarded. In the example, the only surviving cells are ab, b-d with ab = {(1, 2), (5, 6)} and b-d = {(2, 4), (6, 9)}\n\nIteration phase\nLet B be the collection of cells. If m = Extract(B), then m ∈ B and there does not exist m' ∈ B such that m' ∗ m holds: m1 ∗ m2 if one of the following holds: (1) m1 has only solid characters and m2 has at least one non-solid character (2) m2 has the \"-\" character and m1 does not, and, (3) m1 and m2 have d1, d2 \u003e 0 dot characters respectively and d1 \u003cd2.\nFurther, m1 is ~-compatible with m2 if the last solid character of m1 is the same as the first solid character of m2. Further if m1 is ~-compatible with m2, then m = m1 ~ m2 is the concatenation of m1 and m2 with an overlap at the common end and start character and m = {(x, y)|(x, l) ∈ }. For example if m1 = ab and m2 = b.d then m1 is ~-compatible with m2 and m1 ~ m2 = ab.d. However, m2 is not ~-compatible with m1.\nNodeInconsistent(m) is a routine that checks if the new motif m is non-maximal w.r.t. earlier non-ancestral nodes by checking the location lists. The procedure is best described by the pseudocode shown in Table 1. Steps G:18–19 detect the suffix motifs of already detected maximal motifs. Result is the collection of all the maximal extensible patterns.\nA tight time complexity for the procedure is not easy to come by, however, if we consider M to be the number of extensible maximal motifs and S to be the size of the output – i.e. the sum of the sizes of the motifs and the sizes of the corresponding location lists – then the time taken by the algorithm is O(SM log M). In experiments of the kind described later in the paper, at 3 GHz clock, time ranged typically from few minutes to half an hour.","divisions":[{"label":"title","span":{"begin":0,"end":24}},{"label":"p","span":{"begin":25,"end":245}},{"label":"p","span":{"begin":246,"end":699}},{"label":"p","span":{"begin":700,"end":1006}},{"label":"sec","span":{"begin":1008,"end":2321}},{"label":"title","span":{"begin":1008,"end":1028}},{"label":"p","span":{"begin":1029,"end":1300}},{"label":"p","span":{"begin":1301,"end":1847}},{"label":"p","span":{"begin":1848,"end":2321}},{"label":"title","span":{"begin":2323,"end":2338}},{"label":"p","span":{"begin":2339,"end":2695}},{"label":"p","span":{"begin":2696,"end":3114}},{"label":"p","span":{"begin":3115,"end":3468}}],"tracks":[]}