Pseudo-periodic Partition
Let s = a1a2...an be a string of length n. A partition π(s) = {s1, s2, ..., sk} of s is a set of substrings of s such that s = s1s2...sk (si's are also called repeats). When s is clear, we use π instead of π(s). Π(s) denotes the set of all partitions of s. Let si and si+1 be two strings. The similarity measure μ(si, si+1) between si and si+1 is the maximum alignment value for si and si+1. For any two letters (possibly spaces) x and y, μ(x, y) is the similarity score between the two letters. For example, one can use the following score scheme I: a match costs 1, a mismatch costs -1, and an insertion or deletion costs -1. Here we choose to use maximization version since for protein sequences, there are popular similarity matrices, e.g., PAM matrix.
Let c be a negative constant. We call  the granularity factor. Let Δ denotes a space in an alignment. In this paper, we assume that  >μ(x, Δ) for any letter x in the given sequence.
Let us consider the following example. sA = s1s2s3s4s5 and π(sA) = {s1, s2, s3, s4, s5}, where s1 = aaa, s2 = aat, s3 = att, s4 = ttt and s5 = tta. The self-alignment of this partition is show in Figure 1. The value of the self-alignment is , where |s1| and |s5| are the penalty scores for the two segments s1 and s5 aligned to spaces.
Figure 1  The alignment for π(sA). Note that the score for insertion and deletion would be different from the granularity factor . If there is a gap at the right end of the alignment between s4 and s5, there is ambiguity in the calculation of the self-alignment value. Therefore, we need a more precise definition for the value of the self-alignment corresponding to a partition.
Let s = s1s2...sk be the string and π(s) = {s1, s2, ..., sk}. |s| denotes the length of the string. pre(s, i) is the length-i prefix of s and suf(s, i) is the length-i suffix of s. Note that the gap at the right end of the self-alignment of s only appears in the segments sk-1 and sk. Denote by se the suffix sk-1sk of s. We can designate that only the last i letters in se are mapped to spaces with score  each, for 1 ≤ i ≤ |se|. Now let us consider the remain part pre(se, |se| - i) of se. There are two cases: (1) If i ≥ |sk|, pre(se, |se| - i) is a prefix of sk-1 and is optimally aligned with sk. (2) If i < |sk|, pre(se, |se| - i) contains sk-1 and a prefix of sk. In this case, sk-1 is optimally aligned with sk and the letters in the prefix of sk are scored as μ(x, Δ) each.
For a partition π of s and a fixed i, 1 ≤ i ≤ |se|, let V(π, c, i) be the value of the self-alignment such that s1 is mapped to spaces with score  each, sj is optimally aligned with sj+1 for j = 1, 2, ..., k -2, pre(se, |se| - i) is scored according to the above two cases, and the last i letters in se are mapped to spaces with score  each. We have
In V(π, c, i), the alignment between s1s2...sk-2pre(se, |se| - i) and s2s3...sk is called the middle alignment. The value of the self-alignment of π is defined as .
For example, let sB = s1s2s3 and π(sB) = {s1, s2, s3}, where s1 = aaaa, s2 = aaat and s3 = aaa. We use score scheme I and c = -1. The valid value of i is 1, 2, ..., 7 since |s2s3| = 7. For i = 5 ≥ |s3|, pre(s2, 2) is optimally aligned with s3, s1 and suf(s2s3, 5) is scored as  (Figure 2(a)). So V(π(sB), c, 5) = μ(s1, s2) + μ(pre(s2, 2), s3) +  × (|s1| + 5) = . For i = 2 < |s3|, s2 is optimally aligned with s3, pre(s3, 1) is scored as μ(x, Δ) and suf(s3, 2) is scored as  (Figure 2(b)). In this case, V(π(sB), c, 2) = μ(s1, s2) + μ(s2, s3) + μ(x, Δ) × 1 +  × (|s1| + 2) = 0. For i = 4, at the right ends of the optimal self-alignment of π(sB) (Figure 2(c)), there are 4 letters that match spaces. The last letter t in s2 matches a space at the right end of the alignment. The assumption that  >μ(x, Δ) forces this column to have score  instead of μ(t, Δ) to maximize V(π, c). We have V(π(sB), c) = V(π(sB), c, 4) = 1. For i = 1, 3, 6, 7, the values are lower than V(π(sB), c) = V(π(sB), c, 4) = 1.
Figure 2  The alignment for π(sB). Let Π(s) be the set of all possible partitions of s. Bc(s) = maxπ∈Π(s) V(π, c) is the optimal V(·) value of partitions. A partition πq = {s1, s2, ..., sk} of s is called the pseudo-periodic partition of s if Bc(s) = V(πq, c). In Li et al. [1], it was demonstrated that the numerical measure Bc(s) (in fact, the minimization version) is sensitive for partitioning s into repeats that allow the gradual changes of patterns and changes of pattern lengths.
In practice, we are given a long string s. We want to find a region (substring) t of s that contains pseudo-periodic repeats. Once the region t is found, we want to get the pseudo-periodic partition of t. The mathematical problem is defined as follows: