A Heuristic Algorithm to Find All Normalized Local Alignments Above Threshold. Local alignment is an important task in molecular biology to see if two sequences contain regions that are similar. The most popular approach to local alignment is the use of dynamic programming due to Smith and Waterman, but the alignment reported by the Smith-Waterman algorithm has some undesirable properties. The recent approach to fix these problems is to use the notion of normalized scores for local alignments by Arslan, Egecioglu and Pevzner. In this paper we consider the problem of finding all local alignments whose normalized scores are above a given threshold, and present a fast heuristic algo­ rithm. Our algorithm is 180-330 times faster than Arslan et al.’s for sequences of length about 120 kbp and about 40-50 times faster for sequences of length about 30 kbp. Sequence alignment is a fundamental task in molecular biology to see if two sequences are related. The local alignment problem for two sequences is to find a pair of similar regions, one from each sequence. In many biological applications local alignment is more useful than global alignment because two sequences may not be similar as a whole, but they can contain small regions that are very similar. A most popular approach to local alignment is the use of dynamic programming due to Smith and Waterman (Smith and Waterman, 1981; Waterman et al., 1995). The Smith- Waterman algorithm finds a pair of regions whose alignment score is the highest (which is called the optimal local alignment). However, the alignment reported by the Smith-Waterman algorithm has some undesirable properties. The alignment may include regions whose similarity is very poor (Arslan et al., 2001; Zhang et al., 1999). Also, the Smith-Waterman algorithm does not consider the lengths of local alignments in computing scores, and therefore a biologically important but short alignment may not be detected(Arslan and Pevzner et al., 2001). There have been several approaches to fix the problems of the Smith-Waterman algorithm (Goad and Kanehisa, 1982; Sellers, 1984; Zhang eta/., 1998; Zhang etal., 1999; Arslan and Pevzner et a/., 2001). For example, Zhang et al. (Zhang et al., 1998; Zhang and Miller et a/., 1999) proposed an approach based on the notion of X- drop, a region that scores below -X for some fixed X>0. They consider only alignments that contain no X-drop. A more recent approach to fix the problems is the use of normalized scores for local alignments (Marzal and Vidal 1993; Arslan and 6. Egecioglu (1999); Arslan et a/., 2001). Arslan eta/. (2001) used fractional programming to obtain the optimal normalized local alignment of two sequences. They also extended their solution to find all local alignments whose normalized scores are larger than a given threshold by running their solution repeatedly after masking those alignments that are already found. In this paper we consider the problem of finding all local alignments whose normalized scores are above a given threshold. A local alignment whose normalized score is above a threshold will be called a TNL alignment. Since small regions are reported in the local alignment problem, finding all of TNL alignments can be more appropriate than just finding a single (optimal) local alignment in applications such as gene prediction (Gusfield, 1997). We present a fast algorithm for the problem of finding all the TNL alignments. Our algorithm is based on fractional programming and the banded Smith-Waterman algorithm. The fractional programming approach uses the Smith- Waterman algorithm repeatedly to obtain one local alignment. Arslan et al. (Arslan et al., 2001) use this solution repeatedly to find all the TNL alignments. Hence, Arslan et al.' s algorithm makes a double repetition of the Smith-Waterman algorithm, which makes it very slow in practice. Our algorithm first makes a careful selection of bands based on several parameters so that the selected bands can only contain TNL alignments with high probability. Then it runs the Smith-Waterman algorithm on the selected bands in such a way that the number of applications of the Smith-Waterman algorithm on each band is much smaller than that of Arslan etal.' s. Experiments show that our algorithm is about 180-330 times faster than Arslan et al.' s for the data set of human chromosome X cosmids Qc8D3 (GenBank Acc. No. AF030876 (113 kbp)) and mouse chromosome X clone BAC B22804 (GenBank Acc. No. AF121351 (123 kbp)) and it is about 40-50 times faster for the data set of human (GenBank Acc. No. L10347 (31 kbp)) and mouse (GenBank Acc. No. M65161 (32.4 kbp)) pro-alpha1 type the benefit of II collagen, while finding all TNL alignments. In addition to extremely fast speed, our algorithm can change the parameters for selecting bands to adjust the time and accuracy of our algorithm. A string or a sequence is concatenations of zero or more characters from an alphabet X. A space is denoted by J £ 2; we regard J as a character for convenience. The length of a string a is denoted by |a|. Let a, denote /th character of a string a and a(i, j) denote a substring aa+1 ■■■aj of a. When a sequence a is a substring of a sequence a, we denote it by «Ts. A local alignment whose normalized alignment score is above a given threshold will be called a TNL alignment. Therefore, the TNLA problem is to find all the TNL alignments. For the TNLA problem, Arslan et al. (2001) first solve the NLA problem and then mask the solution to repeatedly find the next optimal solution of the NLA problem. In general, it is highly likely that an optimal local alignment has a long part of exact matches in it. The banded Smith- Waterman algorithm uses this phenomenon and finds a solution very quickly for the LA problem with high probability (Setubal and Meidanis, 1997). Many biological systems such as Phrap (Green), STROLL (Chen and Skiena, 1997) and FASTA (Lipman and Pearson, 1998) are based on this algorithm. The banded SW algorithm consists of the following two steps: 1. Determine bands Find all the bands that can have local alignments with high similarity. Usually, a band is defined by the information of exact matches and if two or more bands overlap, they are merged into one band. 2. Run the SW algorithm Run the SW algorithm on the defined bands. The entries of Hi.j outside the bands are assumed 0 and they are not computed. We first present an algorithm for finding an optimally normalized local alignment (the NLA problem) and then present algorithms for finding all local alignments whose normalized alignment scores are above a given threshold Ts (the TNLA problem). Our algorithm for the NLA problem is used as a subroutine in our algorithms for the TNLA problem. We present an algorithm for the NLA problem of two strings a(\a\=m) and b(\b\=ri). Our algorithm is based on Dinkelbach’s algorithm which is a general algorithm to solve the NLA problem. We use the banded SW algorithm to solve the LA problem in Dinkelbach’s algorithm. Since we already described Dinkelbach’s algorithm and the banded SW algorithm, we concentrate on describing how to determine bands. We first introduce some definitions before we describe how to determine the bands. Consider the H table generated by the SW algorithm in computing SLfa, b). The fth-diagonal (or diagonal i), Q for 1 150 and Ti< 150. We have experimented with various values of the three parameters Ti, T> and w, and here we show nine cases for three values of Ti and three values of w and we have chosen an appropriate value of TI for each T. Note that we decrease T b as Ti increases. Otherwise, good bands may be discarded since longer exact matches are fewer than shorter exact matches. Table 2 and Table 3 show execution times of the three algorithms for the TNLA problem. We first analyze the ratios of execution times of the three algorithms. Assume that the widths of all bands are the same and the TNL alignments are equally distributed in the bands which include TNL alignments. Let k be the number of the bands that our algorithms determine, and k’ be the number of the bands which include TNL alignments. Also, let t be the average number of TNL alignments in a band. Then, the equation (1) can be rewritten as follows: kk't k't+k-k' Hence, our improved algorithm is faster than Arslan et al’s by the following factor: where 5 is the speedup of our simple algorithm over Arslan et al.’ s. Consider the experimental results for the first data set in Table 2. When w=150, 7/=15 and T>=50, we have 5=80 and k=5, k’=5, t=\/3 (k, k’ are not shown in Table 2 since the number of TNL alignments is 15 as shown in Table 4). By the above formula, our improved algorithm should be approximately 400 times faster than Arslan et a/.’s and the result (631603/1895=330) is similar to our expectation. (Note that the time of the improved algorithm can vary depending on the value of 71 in each case.) For another example, when w=200, T<=12 and 71=80, we get 5=18, £=24, £ =5 and r=1/3. Thus, our improved algorithm should be about 200 times faster than Arslan etal’ s and the result (631603/3492=180) shows an approximately same ratio. For the second data set, our algorithm is about 40 times faster than Arslan et al’ s when w=150, 77=15 and 71=50, and about 50 times faster when w=200, 77=12 and 71=80. Table 4 shows the accuracy of our algorithms. For example, when w=100, 77=20 and 71=30 in Table 4, our algorithms find 12 out of 15 TNL alignments. But as w increases and T decreases, our algorithms are getting more accurate. Note that our algorithms find all TNL alignments when w> 150 and T< 15. In addition to extremely fast speeds, another advantage of our algorithms is that we can change the parameters w, Ti and 71 to adjust the time and accuracy of our algorithms.