Alignment of tandem duplications
There are many situations where automated alignment procedures can produce biologically incorrect aligments. An obvious challenge are distantly related input sequences where homologies at the primary sequence level may be obscured by spurious random similarities. Another notorious challenge for alignment programs are duplications within the input sequences. Here, tandem duplications are particularly hard to align, see e.g. [22]. Specialised software tools have been developed to cope with the problems caused by sequence duplications [23]. For the segment-based alignment program DIALIGN, the situation is as follows. As described in previous publications, the program constructs pairwise and multiple alignments from pairwise local sequence similarities, so-called fragment alignments or fragments [17,16]. A fragment is defined as an un-gapped pair of equal-length segments from two of the input sequences. Based on statistical considerations, the program assigns a weight score to each possible fragment and tries to find a consistent collection of fragments with maximum total score. For pairwise alignment, a chain of fragments with maximum score can be identified [24]. For multiple sequence sets, all possible pairwise alignments are performed and fragments contained in these pairwise alignments are integrated greedily into a resulting multiple alignment.
As indicated in Figure 1, tandem duplications can create various problems for the above outlined alignment approach. In the following, we discuss two simple examples where duplications can confuse the segment-based alignment algorithm. Let us consider a motif that is duplicated in one or several of the input sequences S1,..., Sk. For simplicity, let us assume that our sequences do not share any significant similarity outside the motif. Moreover, we assume that the degree of similarity among all instances of the motif is roughly comparable. There are no difficulties if two sequences are to be aligned and the motif is duplicated in both sequences, i.e if one has instances  and  of the motif in sequence S1 and instances  and  of the same motif in sequence S2 as in Figure 1 (A). In such a situation, our alignment approach will correctly align  to  and  to  since, for pairwise alignment, our algorithm returns a chain of fragments with maximum total score.
Figure 1  Possible mis-alignments caused by tandem duplications in the segment-based alignment approach (DIALIGN). We assume that various instances of a motif are contained in the input sequence set and that the degree of similarity among the different instances is approximately equal. For simplicity, we also assume that the sequences do not share any similarity outside the conserved motif. Lines connecting the sequences denote fragments identified by DIALIGN in the respective pairwise alignment procedures. (A) If a tandem duplication occurs in two sequences, the correct alignment will be found since the algorithm identifies a chain of local alignments with maximum total score. (B) If a motif is duplicated in one sequence but only one instance M2 is contained in the second sequence, it may happen that M2 is split up and aligned to different instances of the motif in the first sequence. (C) If the motif is duplicated in the first sequence but only one instance of it is contained in sequences two and three, respectively, consistency conflicts can occur. In this case, local similarities identified in the respective pairwise alignments cannot be integrated into one single output alignment. To select a consistent subset of these pairwise similarities, DIALIGN uses a greedy heuristic. Depending on the degree of similarity among the instances of the motif, the greedy approach may lead to serious mis-alignments (D). Note that a strictly greedy algorithm could be confused by this situation and could align, for example,  to  in Figure 1 if the similarity among these two instances of the motif happens to be slightly stronger than the similarity among  and , and among  and , respectively. However, DIALIGN uses a greedy approach only for multiple alignment where an exact solution is not feasible, but for pairwise alignment, the program returns an optimal alignment with respect to the underlying objective function. Thus, under the above assumtion, a meaningful alignment will be produced even if  exhibits stronger similarity to  than to .
The trouble starts if a tandem duplication ,  occurs in S1 but only one instance of the motif, M2, is present in S2. Here, it can happen that the beginning of M2 is aligned to the beginning of  and the end of M2 is aligned to the end of  as in Figure 1 (B). DIALIGN is particularly susceptible to this type of errors since it does not use gap penalties. The situation is even more problematic for multiple alignment. Consider, for example, the three sequences S1, S1, S3 in Figure 1 (C), where two instances ,  of a motif occur in S1 while S2 and S3 each contain only one instance of the motif M2 and M3, respectively. Under the above assumptions, a biologically meaningful alignment of these sequences would certainly align S2 to S3, and both motifs would be aligned either to  or to  – depending on the degree of similarity of S2 and S3 to  and , respectively. Note that such an alignment would also receive a high numerical score since it would involve three pairwise alignments of the conserved motif. However, since the pairwise alignments are carried out independently for each sequence pair, it may happen that the first instance of the motif in sequence S1,  is aligned to M2 but the second instance, , is aligned to M3 in the respective pairwise alignments as in Figure 1 (C). Thus, the correct alignment of M2 and M3 will be inconsistent with the first two pairwise alignments. Depending on the degree of similarity among the motifs, alignment of  and M3 may be rejected in the greedy algorithm, so these motifs may not be aligned in the resulting multiple alignment. It is easy to see that the resulting multiple alignment would not only be biologically questionable, but it would also obtain a numerically lower score as it would involve only two pairwise alignments of the motif.