@ewha-bio:11
|
A Heuristic Algorithm to Find All Normalized Local Alignments Above Threshold.
Local alignment is an important task in molecular
biology to see if two sequences contain regions that are similar. The most popular approach to local alignment is the use of dynamic programming due to Smith and Waterman, but the alignment reported by the Smith-Waterman algorithm has some undesirable properties. The recent approach to fix these problems is to use the notion of normalized scores for local alignments by Arslan, Egecioglu and Pevzner. In this paper we consider the problem of finding all local alignments whose normalized scores are above a given threshold, and present a fast heuristic algo rithm. Our algorithm is 180-330 times faster than Arslan et al.’s for sequences of length about 120 kbp and about 40-50 times faster for sequences of length about 30 kbp.
Sequence alignment is a fundamental task in molecular
biology to see if two sequences are related. The local alignment problem for two sequences is to find a pair of similar regions, one from each sequence. In many biological applications local alignment is more useful than global alignment because two sequences may not be similar as a whole, but they can contain small regions that are very similar.
A most popular approach to local alignment is the use of dynamic programming due to Smith and Waterman (Smith and Waterman, 1981; Waterman et al., 1995). The Smith- Waterman algorithm finds a pair of regions whose alignment score is the highest (which is called the optimal local alignment). However, the alignment reported by the Smith-Waterman algorithm has some undesirable properties. The alignment may include regions whose similarity is very poor (Arslan et al., 2001; Zhang et al., 1999). Also, the Smith-Waterman algorithm does not consider the lengths of local alignments in computing scores, and therefore a biologically important but short alignment may not be detected(Arslan and Pevzner et al., 2001).
There have been several approaches to fix the problems of the Smith-Waterman algorithm (Goad and Kanehisa, 1982; Sellers, 1984; Zhang eta/., 1998; Zhang etal., 1999; Arslan and Pevzner et a/., 2001). For example, Zhang et al. (Zhang et al., 1998; Zhang and Miller et a/., 1999) proposed an approach based on the notion of X- drop, a region that scores below -X for some fixed X>0. They consider only alignments that contain no X-drop. A more recent approach to fix the problems is the use of normalized scores for local alignments (Marzal and Vidal 1993; Arslan and 6. Egecioglu (1999); Arslan et a/., 2001). Arslan eta/. (2001) used fractional programming to obtain the optimal normalized local alignment of two sequences. They also extended their solution to find all local alignments whose normalized scores are larger than a given threshold by running their solution repeatedly after masking those alignments that are already found.
In this paper we consider the problem of finding all local alignments whose normalized scores are above a given threshold. A local alignment whose normalized score is above a threshold will be called a TNL alignment. Since small regions are reported in the local alignment problem, finding all of TNL alignments can be more appropriate than just finding a single (optimal) local alignment in applications such as gene prediction (Gusfield, 1997). We present a fast algorithm for the problem of finding all the TNL
alignments. Our algorithm is based on fractional programming and the banded Smith-Waterman algorithm. The fractional programming approach uses the Smith- Waterman algorithm repeatedly to obtain one local alignment. Arslan et al. (Arslan et al., 2001) use this solution repeatedly to find all the TNL alignments. Hence, Arslan et al.' s algorithm makes a double repetition of the Smith-Waterman algorithm, which makes it very slow in practice. Our algorithm first makes a careful selection of bands based on several parameters so that the selected bands can only contain TNL alignments with high probability. Then it runs the Smith-Waterman algorithm on the selected bands in such a way that the number of applications of the Smith-Waterman algorithm on each band is much smaller than that of Arslan etal.' s.
Experiments show that our algorithm is about 180-330 times faster than Arslan et al.' s for the data set of human chromosome X cosmids Qc8D3 (GenBank Acc. No. AF030876 (113 kbp)) and mouse chromosome X clone BAC B22804 (GenBank Acc. No. AF121351 (123 kbp)) and it is about 40-50 times faster for the data set of human (GenBank Acc. No. L10347 (31 kbp)) and mouse (GenBank Acc. No. M65161 (32.4 kbp)) pro-alpha1 type the benefit of II collagen, while finding all TNL alignments. In addition to extremely fast speed, our algorithm can change the parameters for selecting bands to adjust the time and accuracy of our algorithm.
A string or a sequence is concatenations of zero or more characters from an alphabet X. A space is denoted by J £ 2; we regard J as a character for convenience. The length of a string a is denoted by |a|. Let a, denote /th character of a string a and a(i, j) denote a substring aa+1 ■■■aj of a. When a sequence a is a substring of a sequence
a, we denote it by «<a. Given two strings a=aici2---a,, and b=bib2 --bn, an alignment of a and b is a*=a2*a2*--a* and b*=hi*fe*- fo* constructed by inserting zero or more J's into a and b so that each a* maps to fo* for 1 </<I. There are three kinds of mappings in a* and b* according to the characters of and a*and b*.
■ match : a*=bi* d,
• mismatch : (a* =Ab,*) and (a* bi* =/= J),
• insertion or deletion (indel for short): either a* or b* is J.
Note that we do not allow the case of «*=&*= J.
If there exists an alignment of a and b whose number of matches, mismatches and indels are x,y and z, respectively, then we call the triplet (x, y, z) an alignment vector of a and
b. We define the alignment score of an alignment a* and b* with alignment vector (x, y, z) by score(x, y, z) =x-8y ~nz that is, we assume that a match score is 1, a mismatch
penalty is 8 and an indel penalty is y. We define the similarity of two sequences a and b, denoted by S(a, b), to be the highest alignment score of all possible alignments of a and b.
Given two sequences a and b, a local alignment of a and b is an alignment of two strings a and p such a(a and ft{b, and the optimal local alignment of a and b is the local alignment of a and b that has the highest similarity among all local alignments of a and b. We denote the similarity of an optimal local alignment by SL(a, b).
A most well-known algorithm to find an optimal local alignment was given by Smith and Waterman, which is known as the Smith-Waterman algorithm (SW algorithm for short) (Smith and Waterman, 1981; Waterman, 1995). Given two sequences a(\a\=m) and b(\b\=n), the SW algorithm computes SL(a, b) using a dynamic programming table (called the H table) of size (m+1)(n+1). Let H., for ()<i <m and 0<j<n denote SL(a(l, i), b(l,j)). Then, can be
computed by the following recurrence:
Among all Hu for 0<i<m and 0<j<n, the highest value is SL(a, b), and the SW algorithm takes 0(mn) time to compute it.
Since the SW algorithm does not consider the lengths of local alignments, the solution of the SW algorithm may include some regions with very low similarity. Moreover, a biologically important short alignment may not be detected (Arslan and Pevzner et al., 2001; Alexandrov and Solovyev, 1998; Zhang and Miller etal., 1999). To handle these problems, the notion of normalization for similarity has been introduced (Arslan 6. Egecioglu (1999), Arslan et al., 2001; Egecioglu and Ibel, 1996; Marzal and Vidal,
1993; Alexandrov and Solovyev, 1998).
Given two sequences a and b, we define the normalized similarity of a and b by, S(a, b)/(\aMb\+L), where L is a constant. From this definition we can define a normalized alignment score for local alignment: the normalized alignment score of an alignment a* and b* with alignment vector (x, y, z) is defined by nscore (x, y, z) =
\+2 y +z+L (Arslan and Pevzner etal., 2001). The opti
ma/ normalized local alignment of a and b is the local alignment of a and b that has the highest normalized alignment score.
Here we consider two problems related to normalized local alignment: one is to find the optimal normalized local alignment of a and fr, the other is to find all normalized local alignments of a and b above some threshold Ts.
Given two sequences a and b, let AV(a, b) be the set of all possible alignment vectors of a{a that ft{b. The problem of finding an optimal local alignment (LA problem for short) and that of finding an optimal normalized local alignment (NLA problem for short) are defined as follows:
LA problem : Find (x, y, z) e AV(a, b) such that score(x, y,
z) is greatest inAVfaz, b).
NLA problem : Find (x, y, z) e AV(a, b) such that
nscore(x, y, z) is greatest in AV(a, b).
Note that we can find the score of an optimal local alignment using the SW algorithm but with a simple modification of storing the starting position of each score, we can also find the location of the optimal local alignment. For a given A, we define the parametric local alignment
problem as follows:
LA(A) problem : For all (x, y, z)EAV(a, b), find the
maximal value ofx-J* y - //z-^(2x+2y+z+L)
A parametric local alignment problem can be converted into a local alignment problem because the optimal solution of LA(A) involves solving a local alignment problem and performing some simple computations (Arslan and Pevzner etal., 2001).
We can use Dinkelbach’s algorithm to solve the NLA problem. Dinkelbach developed a general algorithm that uses fractional programming (Dinkelbach, 1967). The optimal solution to NLA can be obtained via a series of optimal solutions of LA(A) for different A’s using the following result:
A is the optimal solution for NLA if and only if LA(A)=0. Dinkelbach’s algorithm (Dinkelbach, 1967) (See Algorithm 1) starts with an initial value for A and repeatedly solves LA(A), Since the LA(A) problem can be converted into an LA problem, it solves the LA problem and obtains an optimal alignment vector (x, y, z). Next, it sets A= nscore(x, v, z) and repeats this process until A is not changed any more. The convergence to the optimal solution of the NLA problem is guaranteed (Arslan et al., 2001).
In some biological applications, we need to find all local alignments whose normalized alignment scores are above a given threshold. Formally, we define this problem as follows:
TNLA problem : For given a threshold value Ts, find all (x,
y, z) e AV(a, b) such that nscore(x, y, z)
>Ts.
A local alignment whose normalized alignment score is above a given threshold will be called a TNL alignment. Therefore, the TNLA problem is to find all the TNL alignments.
For the TNLA problem, Arslan et al. (2001) first solve the NLA problem and then mask the solution to repeatedly find the next optimal solution of the NLA problem.
In general, it is highly likely that an optimal local alignment has a long part of exact matches in it. The banded Smith- Waterman algorithm uses this phenomenon and finds a solution very quickly for the LA problem with high probability (Setubal and Meidanis, 1997). Many biological systems such as Phrap (Green), STROLL (Chen and Skiena, 1997) and FASTA (Lipman and Pearson, 1998) are based on this algorithm.
The banded SW algorithm consists of the following two steps:
1. Determine bands
Find all the bands that can have local alignments with high similarity. Usually, a band is defined by the information of exact matches and if two or more bands overlap, they are merged into one band.
2. Run the SW algorithm
Run the SW algorithm on the defined bands. The entries of Hi.j outside the bands are assumed 0 and they are not computed.
We first present an algorithm for finding an optimally normalized local alignment (the NLA problem) and then present algorithms for finding all local alignments whose normalized alignment scores are above a given threshold Ts (the TNLA problem). Our algorithm for the NLA problem is used as a subroutine in our algorithms for the TNLA
problem.
We present an algorithm for the NLA problem of two strings a(\a\=m) and b(\b\=ri). Our algorithm is based on Dinkelbach’s algorithm which is a general algorithm to solve the NLA problem. We use the banded SW algorithm to solve the LA problem in Dinkelbach’s algorithm. Since we already described Dinkelbach’s algorithm and the banded SW algorithm, we concentrate on describing how to determine bands.
We first introduce some definitions before we describe how to determine the bands. Consider the H table generated by the SW algorithm in computing SLfa, b). The fth-diagonal (or diagonal i), Q<i<m(resp. -n<f<-1), represents a diagonal that passes through (0, i) (resp. (-i, 0)) element of the H table. A band is a set of diagonals and the width of a band is the number of diagonals in the band. A band i of width k is a set of diagonals from i through i+k-
1. We say that an exact match (i,j) occurs in diagonal k if a(i,j)= b(i+k,j+k). Exact match (i,j) is maximal if a-i^bi+k-i and ai+i=Abj+M. The length of exact match (i, j) is j-i+1.
We show how to determine the bands with given parameters 7), w, and 7k The parameter T is a threshold on the length of an exact match, w is the width of a band, and Tb is a threshold on the weight of a band. Determining the bands is composed of the following four steps:
1. Find all maximal exact matches longer than T in each diagonal. We first generate all suffixes of a and b and sort them in lexicographic order. Once all the suffixes of a and b are sorted, it is easy to find all maximal exact matches longer than T.
2. Compute the weight of each diagonal, where the weight of a diagonal is the sum of the lengths of all maximal exact matches longer than T in the diagonal.
3. Select every band b of width w whose weight is above Tb, where the weight of a band is the sum of the weights of all diagonals in the band.
4. Merge two bands Bi={di, dj} and B 2 ={dk, — di} into a
single band B={min(d, dk), •••, mwcfd, di)}, if they overlap (i.e., i<k<j or i<l<j). Repeat merging until no two bands overlap. (Fig. 1)
Parameters T, w, and Tb should be chosen appropriately so that an optimal normalized local alignment does not lie out of the bands. In Section 4, we suggest appropriate values for Ti, w, and Tb through experiments.
Our algorithm for the NLA problem first determines the bands as explained above and then runs the SW algorithm on the selected bands. The speedup achieved by our algorithm over Arslan et al.' s is due to the following two improvements.
■ We uses the banded SW algorithm to solve the LA
problem of Dinkelbach’s algorithm while Arslan et al. used the SW algorithm. Thus, the speedup achieved by the banded SW algorithm over the SW algorithm is reflected in our algorithm.
We perform the step of determining bands in the banded SW algorithm only once while we perform the banded SW algorithm several times in Dinkelbach’s algorithm. Since the bands are the sets of diagonals where long exact matches occur, the bands in all repetitions of the banded SW algorithm are the same. Hence, we need to determine the bands only once.
We now present algorithms for the TNLA problem of two strings a and b. We first give a simple algorithm that finds all normalized local alignments whose alignment scores are above a given threshold L (the TNL alignments) in such a way that a TNL alignment of larger alignment score is found earlier than a TNL alignment of smaller alignment score. Arslan etal. (Arslan etal., 2001) also found the TNL alignments in the same decreasing order. Then, we improve the simple algorithm by finding the TNL alignments in a somewhat different order.
The simple algorithm consists of two steps. Step 1 is performed once and step 2 is repeated until all TNL alignments are found.
Step 1. Find all maximal exact matches of a and b longer
than Ti and determine the bands as we did in Section 3.1.
Step 2. Find an optimal normalized local alignment of a and
b using the algorithm presented in Section 3.1. If the alignment score of the found alignment is above T, mask the found alignment and repeat
step 2. Otherwise, terminate.
The simple algorithm finds the normalized local alignments in decreasing order of alignment scores. Hence, it is guaranteed that every TNL alignment has been found when the simple algorithm terminates.
We consider the speedup achieved by the simple algorithm over Arslan et a/.’ s algorithm. Let p denote the number of the TNL alignments. Step 2 iterates just p times. Since the time for masking an alignment is negligible and our algorithm performs step 1 only once, our simple algorithm for the TNLA problem takes about p times our algorithm for the NLA problem. Also, Arslan et a/.’ s algorithm for the TNLA problem takes p times their algorithm for the NLA problem. Therefore, the speedup achieved by our simple algorithm over Arslan et al’ s algorithm for the TNLA problem is the speedup achieved by our algorithm over Arslan et a/.’s algorithm for the NLA problem.
As long as all the TNL alignments are guaranteed to be found, the order of the alignments that are found may not be important. If we find the TNL alignments in a different order, we can find them more efficiently. The improved algorithm is as follows.
Step 1. Find all maximal exact matches of a and b longer
than T/ and determine the bands as in Section 3.1. Let Bi, B2, “,Bk denote the bands.
Step 2. For each band B> for 1 <i<k, perform the following.
Find an optimal normalized local alignment within Bi by running the SW algorithm only in Bi. If the normalized alignment score of the found alignment is above L, mask the found alignment and repeat. Otherwise, we are done with Bi.
It is easy to see that all the TNL alignments have been
found when the improved algorithm terminates.
We consider the speedup achieved by the improved algorithm over the simple algorithm. Let C, l<i<k, denote the number of the TNL alignments included in band B. and
W the width of B,. In the improved algorithm, band B, accessed when we find each of C TNL alignments while
the simple algorithm, all bands Bi, B2, •••, B k are accessed
when we find each of G+C2+—+G TNL alignments. Note that the improved algorithm accesses B, at least once even if there is no TNL alignment in it. Hence, the speedup achieved by the improved algorithm over the simple algorithm is
Remark: We can consider affine gaps in both problems of
NLA and TNLA. The affine gap penalty for a gap is defined as 7+/*k, where 7 is a gap open penalty, p is an indel
penalty and k is the length of a gap. We can include the affine gap penalty in our algorithms for NLA and TNLA by a simple modification of the recurrence of the SW algorithm (Gotoh 1982; Waterman 1995).
We implemented and compared two algorithms for the NLA problem: Arslan et al.' s and ours, and three algorithms for the TNLA problem: Arslan et al’.s algorithm, our simple algorithm and our improved algorithm. In all our implementations, we considered the affine gaps.
We have chosen two data sets to test the efficiency of the algorithms: (i) human chromosome X cosmids Qc8D3 (GenBank Acc. No. AF030876 (113 kbp)) and mouse chromosome X clone BAC B22804 (GenBank Acc. No. AF121351 (123 kbp)) and (ii) human (GenBank Acc. No. L10347 (31 kbp)) and mouse (GenBank Acc. No. M65161 (32.4 kbp)) pro-alpha1 type II collagen. Since the repeats which are biologically uninteresting regions may have high normalized scores, we masked the repeats by RepeatMasker (http:// repeatmasker, genome.washington. edu/) before running all the algorithms.
We implemented all the algorithms in C++ language. The programs were run on Pentium III 866MHzx2 work station with 768MB main memory. The parameter values used in our algorithms are as follows: L=2000 (a constant in defining normalized scores), £=1 (mismatch penalty), 7 =6 (gap open penalty), /z=0.2 (indel penalty), and T=0.035 (threshold of normalized score).
Our algorithm finds the same optimal normalized local alignment as Arslan et al.’s algorithm does even when w=100 for both data sets. As shown in Table 1, our algorithm is about 20 times faster than Arslan et al.' s for both data sets when w> 150 and Ti< 150.
We have experimented with various values of the three parameters Ti, T> and w, and here we show nine cases for three values of Ti and three values of w and we have chosen an appropriate value of TI for each T. Note that we
decrease T b as Ti increases. Otherwise, good bands may
be discarded since longer exact matches are fewer than shorter exact matches.
Table 2 and Table 3 show execution times of the three algorithms for the TNLA problem. We first analyze the ratios of execution times of the three algorithms. Assume that the widths of all bands are the same and the TNL alignments are equally distributed in the bands which include TNL alignments. Let k be the number of the bands
that our algorithms determine, and k’ be the number of the bands which include TNL alignments. Also, let t be the average number of TNL alignments in a band. Then, the equation (1) can be rewritten as follows:
kk't
k't+k-k'
Hence, our improved algorithm is faster than Arslan et al’s by the following factor:
where 5 is the speedup of our simple algorithm over Arslan et al.’ s.
Consider the experimental results for the first data set in Table 2. When w=150, 7/=15 and T>=50, we have 5=80 and k=5, k’=5, t=\/3 (k, k’ are not shown in Table 2 since the number of TNL alignments is 15 as shown in Table 4). By the above formula, our improved algorithm should be approximately 400 times faster than Arslan et a/.’s and the result (631603/1895=330) is similar to our expectation. (Note that the time of the improved algorithm can vary depending on the value of 71 in each case.) For another example, when w=200, T<=12 and 71=80, we get 5=18, £=24, £ =5 and r=1/3. Thus, our improved algorithm should be about 200 times faster than Arslan etal’ s and the result (631603/3492=180) shows an approximately same ratio.
For the second data set, our algorithm is about 40 times faster than Arslan et al’ s when w=150, 77=15 and 71=50, and about 50 times faster when w=200, 77=12 and 71=80.
Table 4 shows the accuracy of our algorithms. For example, when w=100, 77=20 and 71=30 in Table 4, our algorithms find 12 out of 15 TNL alignments. But as w increases and T decreases, our algorithms are getting more accurate. Note that our algorithms find all TNL alignments when w> 150 and T< 15.
In addition to extremely fast speeds, another advantage of our algorithms is that we can change the parameters w, Ti and 71 to adjust the time and accuracy of our algorithms.
|
Annnotations
- Denotations: 0
- Blocks: 0
- Relations: 0