Anchored protein alignments
BAliBASE is a benchmark database to evaluate the performance of software programs for multiple protein alignment [37]. The database consists of a large number of protein families with known 3D structure. These structures are used to define so-called core blocks for which 'biologically correct' alignments are known. There are two scoring systems to evaluate the accuracy of multiple alignments on BAliBASE protein families. The BAliBASE sum-of-pairs score measures the percentage of correctly aligned pairs of amino acid residues within the core blocks. By contrast, the column score measures the percentage of correctly aligned columns in the core blocks, see [38,10] for more details. These BAliBASE scoring functions are not to be confused with the objective functions used by different alignment algorithms.
Thus, alignment programs can be evaluated by their ability to correctly align these core blocks. BAliBASE covers various alignment situations, e.g. protein families with global similarity or protein families with large internal or terminal insertions or deletions. However, it is important to mention that most sequences in the standard version of BAliBASE are not real-world sequences, but have been artificially truncated by the database authors who simply removed non-homologous C-terminal or N-terminal parts of the sequences. Only the most recent version of BAliBASE provides the original full-length sequence sets together with the previous truncated data. Therefore, most studies based on BAliBASE have a strong bias in favour of global alignment programs such as CLUSTAL W [1]; these programs perform much better on the BAliBASE data than they would perform on on realistic full-length protein sequences. The performance of programs that are based on local sequence similarities, on the other hand, is systematically underestimated by BAliBASE. Despite this systematic error, test runs on BAliBASE can give a rough impression about the performance of multiple-alignment programs in different situations.
DIALIGN has been shown to perform well on those data sets in BAliBASE that contain large insertions and deletions. On the other hand, it is often outperformed by global alignment methods on those data sets where homology extends over the entire sequence length but similarity is low at the primary-sequence level. For the further development and improvement of the program, it is crucial to find out which components of DIALIGN are to blame for the inferiority of the program on this type if sequence families. One possibility is that biologically meaningful alignments on BAliBASE would have high numerical scores, but the greedy heuristic used by DIALIGN is inefficient and returns low-scoring alignments that do not align the core blocs correctly. In this case, one would use more efficient optimisation strategies to improve the performance of DIALIGN on BAliBASE. On the other hand, it is possible that the scoring function used in DIALIGN assigns highest scores to biologically wrong alignments. In this case, an improved optimisation algorithm would not lead to any improvement in the biological quality of the output alignments and it would be necessary to improve the objective function used by the program.
To find out which component of DIALIGN is to blame for its unsatisfactory performance on some of the BAliBASE data, we applied our program to BAliBASE (a) using the non-anchored default version of the program and (b) using the core blocks as anchor points in order to enforce biologically correct alignments of the sequences. We then compared the numerical DIALIGN scores of the anchored alignments to the non-anchored default alignments. The results of these program runs are summarised in Table 3. The numerical alignment scores of the (biologically correct) anchored alignments turned out to be slightly below the scores of the non-anchored default alignments.
Table 3  DIALIGN alignment scores for anchored and non-anchored alignment of five reference test sets from BAliBASE. As anchor points, we used the so-called core-blocks in BAliBASE, thereby enforcing biologically correct alignments of the input sequences. The figures in the first and second line refer to the sum of DIALIGN alignment scores of all protein families in the respective reference set. Line four contains the number of sequence sets where the anchoring improved the alignment score together with the total number of sequence sets in this reference set. Our test runs show that on these test data, biologically meaningful alignments do not have higher DIALIGN scores than alignments produced by the default version of our program.
Alignment scores
Ref1  Ref2  Ref3  Ref4  Ref5  Total
non-anchored  53,613  269,009  283,273  36,515  29,214  671,624
anchored  53,417  265,966  283,136  36,611  29,257  668,387
ratio  0.996  0.988  0.999  1.002  1.001  0.995
score improved  23/82  13/23  4/23  6/16  4/12  50/156 As an example, Figure 4 shows an alignment calculated by the non-anchored default version of DIALIGN for BAliBASE reference set lr69. This sequence set consists of four DNA-binding proteins and is a challenging alignment example as there is only weak similarity at the primary sequence level. These proteins contain three core blocks for which a reliable multi-alignment is known based on 3D-structure information. As shown in Figure 4, most of the core blocks are misaligned by DIALIGN because of the low level of sequence similarity. With the BAliBASE scoring system for multiple alignments, the default alignment produced by DIALIGN has a sum-of-pairs score of only 33%, i.e. 33% of the amino-acid pairs in the core blocks are correctly aligned. The column score of this alignment 0%, i.e. there is not a single column of the core blocks correctly aligned.
Figure 4  Anchored and non-anchored alignment of a set of protein sequences with known 3D structure (data set lr69 from BAliBASE [38]). Three core blocks for which the 'correct' alignment is known are shown in red, blue and green. (A) Alignment calculated by DIALIGN with default options. Most of the core blocks are mis-aligned. (B) Alignment calculated by DIALIGN with anchoring option. The first position of the third block has been used as anchor point, i.e. the program has been forced to align this column correctly. The rest of the sequences is automatically aligned by DIALIGN given the constraints defined by this anchor point. Although only one single column has been used for anchoring, the tree blocks are almost perfectly aligned. We investigated how many anchor points were necessary to enforce a correct alignment of the three core blocks in this test example. As it turned out, it was sufficient to use one single column of the core blocks as anchor points, namely the first column of the third motif. Technically, this can be done by using three anchor points of length one each: anchor point connecting the first position of this core block in sequence 1 with the corresponding position in sequence 2, another anchor connecting sequence 1 with sequence 3 and a third anchor connecting sequence 1 with sequence 4. Although our anchor points enforced the correct alignment only for a single column, most parts of the core blocks were correctly aligned as shown in Figure 4. The BAliBASE sum-of-pairs score of the resulting alignment was 91% while the column score was 90% as 18 out of 20 columns of the core blocks were correctly aligned. As was generally the case for BAliBASE, the DIALIGN score of the (biologically meaningful) anchored alignment was lower than the score of the (biologically wrong) default alignment. The DIALIGN score of the anchored alignment was 9.82 compared with 11.99 for the non-anchored alignment, so here the score of the anchored alignment was around 18 percent below the score of the non-anchored alignment.