Measures for finding repeats Three measures can be used to give partitions of repeated regions. Quasiperiodicty Wan and Song proposed a measure in which all the repeated copies (except the last one) have the same length [6]. For this measure, a linear time and space algorithm was given [6]. Approximate periods Sim et al. [7] introduced a notion of approximation periods (approximate period) using edit distance or relative edit distance. The problem in general is defined as follows: given a string x, find a repeated pattern p such that x can be partitioned as x = p1p2...pk and is minimized. Here d(p, pl) is the relative edit distance which is the edit distance, where L = (|p| + |pl|)/2 is the average length of the two strings p and pl. Note that, the normalization of the edit distance is important for finding repeated patterns since otherwise, one can give a partition in which each pattern has one letter and the edit distance is at most 1 (small). The problem in general is NP-hard [7]. When the repeated pattern p is assumed to be a substring of x, The problem can be solved in O(|x|4) time. Note that the second measure is more general than the first since it allows insertions and deletions. Both measures in [7] and [6] use the bottleneck function that finds the repeated pattern p and assumes that each copy pi in the long string is close to the repeated pattern p, i.e., d(pi, p) ≤ δ and δ is minimized. However, in biological sequences, copies of the repeated patterns may change gradually so that some repeats in the region may have very little in common. For example, it is well-known that the N-terminal non-globular region of Thermus thermophilus seryl-tRNA synthetase (PDB:1SRY) [1,8] has weak 7-residue repeats. See Table 1. The similarity score between two consecutive patterns is calculated using Blosum62 matrix and the gap penalty is set to be -4. The repeated patterns gradually changes from the 4-th unit LDLEALLA to the 13-th unit KEARLE. The average similarity score for the nine pairs of consecutive patterns is 4.56. But the similarity score between the 4-th unit and the 8-th unit is -11. In this case, the algorithms based on the bottleneck function may fail to find the multiple repeats. Table 1 Pseudo periodic repeats of 1SRY (matrix:blosum62, gap penalty: -4) Unit Pseudo-periodic unit Length Similarity with previous unit 1 MVDLKRLR 8 2 QEPEVFHR 8 -5 3 AIREKGVA 8 -10 4 LDLEALLA 8 -1 5 LDREVQEL 8 7 6 KKRLQEVQ 8 6 7 TERNQVA 7 6 8 KRVPKAP 7 -4 9 PEEKEAL 7 -2 10 IARGKAL 7 3 11 GEEAKRL 7 3 12 EEALRE 6 10 13 KEARLE 6 12 14 ALLLQV 6 -6 15 PLPP 4 -8 Pseudo-periodic repeats Li et al. [1] gave the first measure that allows gradual changes of patterns and changes of pattern lengths in the region. The repeats they defined are called the pseudo-periodic repeats. Given a repeated region (a string) x and a partition X = s1s2...sk, the pseudo periodic score is where d(·) is the edit distance, |si| is the length of si, and c is a factor that control the penalty of the two ends of the partition. Li et al. [1] gave a O(|x|2) algorithm to compute an optimal partition of a given repeated region x. It was shown that the pseudo-periodic score can accurately give partitions for tandem repeated regions, where the repeated patterns are weakly similar. Example: The example is from [1]. The sequence of the LbH domain of members of the LpxA family consists of the imperfect tandem repetition of hexapeptide units [9-11]. These imperfect tandem repeats (partitions) have been accurately detected by the algorithm using the pseudo periodic score [1]. (See Table 2). Table 2 Pseudo periodic repeats of LPXA_ECOLI (matrix:blosum62, gap penalty: -4) Unit Pseudo-periodic unit Length Similarity with previous unit 1 MIDKSAFVHPTAIVEEGA 18 2 SIGANAHIGPFCIVGPHV 18 14 3 EIGEGTVLKSHVVVNGHT 18 16 4 KIGRDNEIYQFASI 14 -7 5 GEVNQDLKYAGEPTR 15 -3 6 VEIGDRNRIRESVTI 15 2 7 HRGTVQGGGL 10 -14 8 TKVGSDNLLMINAHIAHD 18 -24 9 CTVGNRCILANNATLAGH 18 20 10 VSVDDFAIIGGMTAVHQF 18 4 11 CIIGAHVMVGGCSGV 15 4 In sequence analysis, we may have a long sequence s and only a substring t (or a few substrings) of s contains the consecutive repeats. The problem here is to find out the substring t and give an optimal pseudo-periodic partition. We call this problem the local pseudo-periodic problem. In this paper, we define the maximization version of the pseudo-periodic partition and develop an algorithm that solves the local pseudo-periodic problem in O(n2) time, where n is the length of the input sequence s.