Background
Finding pseudo-periodic repeats (or tandem repeats) is an important task in biological sequence analysis [1-3]. The genomes of many species are dominated by short sequences repeated consecutively. It is estimated that over 10% of the human genome consists of tandemly repeated sequences. About 10–25% of all known proteins have some form of repeated structure ranging from simple homopolymers to multiple duplications of entire globular domains. An instance (originally from Jaitly et al. [2]) of a human tandem repeat appears below (Genbank:10120313):
CCTCCTCCTCCACCTCCTCCTCCTCCTCCTCCTCCTCCGCCTTCTCATCCTCCTCCACTT
CCTCCTCCTCCTCCTCCTCCCCTTCTCATCCTCCTCCTCTTCATCTACCC
This tandem repeat consists of 35 approximate copies of the repeated pattern CCT.
Variation in the pseudo-periodic repeats demonstrates biologically important information. Sensitive tools for finding those regions containing pseudo-periodic repeats are required in practice. Repeats occur frequently in biological sequences, but they may not be exact in many cases. If the repeats are exact, the problem can be easily solved from computation point of view. However, repeats are seldom exact in biological sequences. The errors in those repeats make it difficult to find regions of those repeats. Many measures and algorithms have been proposed.
Landau and Schmidt [4] studied the problem of finding the two consecutive copies in a sequence of length n such that the edit distance (a match costs 0 and a mismatch/indel costs 1) between the two copies is at most k. The running time of the algorithm is O(kn log k log(n/kL)). Schmidt [5] used weighted grid digraphs for finding all non-overlapping pairs of substrings (not necessarily consecutive) with the highest scores in a given string of length n. The algorithm can handle any score scheme. It requires O(n2 log n) time and Θ(n2) space. In both [4] and [5], only two copies of the pattern are considered.

Measures for finding repeats
Three measures can be used to give partitions of repeated regions.

Quasiperiodicty
Wan and Song proposed a measure in which all the repeated copies (except the last one) have the same length [6]. For this measure, a linear time and space algorithm was given [6].

Approximate periods
Sim et al. [7] introduced a notion of approximation periods (approximate period) using edit distance or relative edit distance. The problem in general is defined as follows: given a string x, find a repeated pattern p such that x can be partitioned as x = p1p2...pk and  is minimized. Here d(p, pl) is the relative edit distance which is  the edit distance, where L = (|p| + |pl|)/2 is the average length of the two strings p and pl. Note that, the normalization of the edit distance is important for finding repeated patterns since otherwise, one can give a partition in which each pattern has one letter and the edit distance is at most 1 (small). The problem in general is NP-hard [7]. When the repeated pattern p is assumed to be a substring of x, The problem can be solved in O(|x|4) time. Note that the second measure is more general than the first since it allows insertions and deletions. Both measures in [7] and [6] use the bottleneck function that finds the repeated pattern p and assumes that each copy pi in the long string is close to the repeated pattern p, i.e., d(pi, p) ≤ δ and δ is minimized. However, in biological sequences, copies of the repeated patterns may change gradually so that some repeats in the region may have very little in common. For example, it is well-known that the N-terminal non-globular region of Thermus thermophilus seryl-tRNA synthetase (PDB:1SRY) [1,8] has weak 7-residue repeats. See Table 1. The similarity score between two consecutive patterns is calculated using Blosum62 matrix and the gap penalty is set to be -4. The repeated patterns gradually changes from the 4-th unit LDLEALLA to the 13-th unit KEARLE. The average similarity score for the nine pairs of consecutive patterns is 4.56. But the similarity score between the 4-th unit and the 8-th unit is -11. In this case, the algorithms based on the bottleneck function may fail to find the multiple repeats.
Table 1  Pseudo periodic repeats of 1SRY (matrix:blosum62, gap penalty: -4)
Unit  Pseudo-periodic unit  Length  Similarity with previous unit
1  MVDLKRLR  8
2  QEPEVFHR  8  -5
3  AIREKGVA  8  -10
4  LDLEALLA  8  -1
5  LDREVQEL  8  7
6  KKRLQEVQ  8  6
7  TERNQVA  7  6
8  KRVPKAP  7  -4
9  PEEKEAL  7  -2
10  IARGKAL  7  3
11  GEEAKRL  7  3
12  EEALRE  6  10
13  KEARLE  6  12
14  ALLLQV  6  -6
15  PLPP  4  -8

Pseudo-periodic repeats
Li et al. [1] gave the first measure that allows gradual changes of patterns and changes of pattern lengths in the region. The repeats they defined are called the pseudo-periodic repeats. Given a repeated region (a string) x and a partition X = s1s2...sk, the pseudo periodic score is
where d(·) is the edit distance, |si| is the length of si, and c is a factor that control the penalty of the two ends of the partition. Li et al. [1] gave a O(|x|2) algorithm to compute an optimal partition of a given repeated region x. It was shown that the pseudo-periodic score can accurately give partitions for tandem repeated regions, where the repeated patterns are weakly similar.
Example: The example is from [1]. The sequence of the LbH domain of members of the LpxA family consists of the imperfect tandem repetition of hexapeptide units [9-11]. These imperfect tandem repeats (partitions) have been accurately detected by the algorithm using the pseudo periodic score [1]. (See Table 2).
Table 2  Pseudo periodic repeats of LPXA_ECOLI (matrix:blosum62, gap penalty: -4)
Unit  Pseudo-periodic unit  Length  Similarity with previous unit
1  MIDKSAFVHPTAIVEEGA  18
2  SIGANAHIGPFCIVGPHV  18  14
3  EIGEGTVLKSHVVVNGHT  18  16
4  KIGRDNEIYQFASI  14  -7
5  GEVNQDLKYAGEPTR  15  -3
6  VEIGDRNRIRESVTI  15  2
7  HRGTVQGGGL  10  -14
8  TKVGSDNLLMINAHIAHD  18  -24
9  CTVGNRCILANNATLAGH  18  20
10  VSVDDFAIIGGMTAVHQF  18  4
11  CIIGAHVMVGGCSGV  15  4 In sequence analysis, we may have a long sequence s and only a substring t (or a few substrings) of s contains the consecutive repeats. The problem here is to find out the substring t and give an optimal pseudo-periodic partition. We call this problem the local pseudo-periodic problem. In this paper, we define the maximization version of the pseudo-periodic partition and develop an algorithm that solves the local pseudo-periodic problem in O(n2) time, where n is the length of the input sequence s.