PMC:1679804 / 20532-25341 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"17118189-8324630-1688067","span":{"begin":541,"end":542},"obj":"8324630"},{"id":"17118189-8697238-1688068","span":{"begin":543,"end":544},"obj":"8697238"},{"id":"17118189-14980017-1688069","span":{"begin":1271,"end":1272},"obj":"14980017"},{"id":"17118189-12824369-1688070","span":{"begin":3249,"end":3251},"obj":"12824369"},{"id":"17118189-15980505-1688071","span":{"begin":3263,"end":3265},"obj":"15980505"},{"id":"17118189-8532532-1688072","span":{"begin":3286,"end":3288},"obj":"8532532"},{"id":"17118189-15860560-1688073","span":{"begin":3289,"end":3291},"obj":"15860560"},{"id":"17118189-12520026-1688074","span":{"begin":3383,"end":3385},"obj":"12520026"},{"id":"17118189-8324630-1688075","span":{"begin":4441,"end":4442},"obj":"8324630"},{"id":"17118189-8697238-1688076","span":{"begin":4443,"end":4444},"obj":"8697238"}],"text":"Related work\nMany existing pattern matching algorithms [8,11-18] can be used to solve the simple pattern search problem. Given the sequence length, n, and the pattern length, m, exact matching algorithm can run in O (n + m) [11]; approximate matching algorithm can run in O (rn) [16,17] or O (nm/w) [18], where r is the error threshold and w is the size of a computer word. The space complexity is O (n) for both exact matching and approximate matching.\nSeveral previous efforts have focused on the structured pattern search problem. Anrep [3,4] provides a unified biosequence pattern representation by using network expressions with spacers, where a network expression is a regular expression without Kleene closure. With network expressions, one can specify the scoring scheme and the threshold of approximate matching for each simple motif separately, the positional weights which express the relative importance of different parts of a motif, and whether a simple motif is optional. Anrep introduces a two-step approach: first it searches for simple motifs by a threshold-sensitive motif matching algorithm and then it finds the structured motif by an optimized backtracking matching algorithm. However, as compared by [6], Anrep is much slower than SMARTFINDER.\nIn [5], the structured motif is called a Classes of Characters and Bounded Gaps (CBG) expression and is represented by a non-deterministic ε-automaton with bit parallelism. Two algorithms are proposed for CBG expression search: forward search and backward search. Bit parallelism speeds up the search, but is adequate only for a pattern whose maximum span is smaller than the length of the computer word. Also the implementation of CBG can only handle such pattern. This limits the application of CBG to searching for patterns with small number of symbols and gaps.\nSMARTFINDER [6,7], which is currently the most efficient method for structured motif search, is also a two-step approach. In the first step each simple motif is searched separately by building a suffix tree for the sequence. This step outputs the ordered occurrence lists of all simple motifs. The second step solves a constraint satisfaction problem by considering constraints individually in three sub-steps. First it considers the gap constraints and builds a constraint graph whose nodes are the simple motif occurrences and edges connect all possible pairs of nodes that locally satisfy the gap constraints. It then considers the constraint for the maximum number of missing components and shrinks the graph to contain only the nodes that can be in the structured motif occurrences. Finally it enumerates all the valid occurrences by a depth first search (DFS). Notable differences in SMOTIF and SMARTFINDER are as follows: we search patterns directly by positional joins over an inverted index, we consider variable gap constraints during the positional joins as opposed to building a constraint graph, and we handle missing components more efficiently by considering them over patterns instead of over each occurrence as in SMARTFINDER. Note also that like Anrep and SMARTFINDER, SMOTIF can also mine approximate patterns, when using the two-step approach, which we describe later.\nFor profile search, MATCH [19], P-Match [20], and MatInspector [21,22] search DNA sequences against a position weight matrix library (such as TRANSFAC database [23]) and report the occurrences that satisfy given score thresholds. They compute the matrix score by multiplying the base frequency with the information content value at each position, in order to emphasize the fact that mismatches at less conserved positions are more easily tolerated than mismatches at highly conserved positions. Besides the matrix score, they define a core region, which is usually the first 4–5 most conserved consecutive positions of the matrix, and perform the core score threshold check. Then they align the matrix to each position of the sequence and calculate the core score and matrix score. However, these algorithm don't consider the prior probability of each base when calculating the matrix (or core) score, and the core region is required to be consecutive. They need to check all positions of each subsequence (at least all the core positions) in order to calculate the matrix (core) score. Moreover, these algorithms only work on simple profile with one single matrix component. For structured profile search, only Anrep [3,4] provides the capability to model structured profiles, with its general network expressions. However, Anrep doesn't give a solution on score calculation and fast search for structured profiles. Moreover, its implementation doesn't support structured profile search. To our knowledge, SMOTIF is the only implemented method that can handle structured profile search."}