Background
Multiple sequence alignment is a crucial prerequisite for biological sequence data analysis, and a large number of multi-alignment programs have been developed during the last twenty years. Standard methods for multiple DNA or protein alignment are, for example, CLUSTAL W [1], DIALIGN [2] and T-COFFEE [3]; an overview about these tools and other established methods is given in [4]. Recently, some new alignment approaches have been developed such as POA [5], MUSCLE [6] or PROBCONS [7]. These programs are often superior to previously developed methods in terms of alignment quality and computational costs. The performance of multi-alignment tools has been studied extensively using various sets of real and simulated benchmark data [8-10].
All of the above mentioned alignment methods are fully automated, i.e., they construct alignments following a fixed set of algorithmical rules. Most methods use a well-defined objective function assigning numerical quality score to every possible output alignment of an input sequence set and try to find an optimal or near-optimal alignment according to this objective function. In this process, a number of program parameters such as gap penalties can be adjusted. While the overall influence of these parameters is quite obvious, there is usually no direct way of influencing the outcome of an alignment program.
Automated alignment methods are clearly necessary and useful where large amounts of data are to be processed or in situations where no additional expert information is available. However, if a researcher is familiar with a specific sequence family under study, he or she may already know certain parts of the sequences that are functionally, structurally or phylogenetically related and should therefore be aligned to each other. In situations where automated programs fail to align these regions correctly, it is desirable to have an alignment method that would accept such user-defined homology information and would then align the remainder of the sequences automatically, respecting these user-specified constraints.
The interactive program MACAW [11] can be used for semi-automatic alignment with user-defined constraints; similarly the program OWEN [12,13] accepts anchor points for pairwise alignment. Multiple-alignment methods accepting pre-defined constraints have also been proposed by Myers et al. [14] and Sammeth et al. [15]. The multi-alignment program DIALIGN [16,17] has an option that can be used to calculate alignments under user-specified constraints. Originally, this program feature has been introduced to reduce the alignment search space and program running time for large genomic sequences [18,19]; see also [20]. At Göttingen Bioinformatics Compute Server (GOBICS), we provide a user-friendly web interface where anchor points can be used to guide the multiple alignment procedure [21]. Herein, we describe our anchored-alignment approach in detail using a previously introduced set-theoretical alignment concept. We apply our method to genomic sequences of the Hox gene clusters. For these sequences, the default version of DIALIGN produces serious mis-alignments where entire genes are incorrectly aligned, but meaningful alignments can be obtained if the known gene boundaries are used as anchor points.
In addition, our anchoring procedure can be used to obtain information for the further development of alignment algorithms. To improve the performance of automatic alignment methods, it is important to know what exactly goes wrong in those situations where these methods fail to produce biologically reasonable alignments. In principle, there are two possible reasons for failures of alignment programs. It is possible that the underlying objective function is 'wrong' by assigning high numerical scores to biologically meaningless alignments. But it is also possible that the objective function is 'correct' – i.e. biologically correct alignments have numerically optimal scores -and the employed heuristic optimisation algorithm fails to return mathematically optimal or near-optimal alignments. The anchoring approach that we implemented can help to find out which component of our alignment program is to blame if automatically produced alignments are biologically incorrect.
One result of our study is that anchor points can not only improve the biological quality of the output alignments but can in certain situations lead to alignments with significantly higher numerical scores. This demonstrates that the heuristic optimisation procedure used in DIALIGN may produce output alignments with scores far below the optimum for the respective data set. The latter result has important consequences for the further development of our alignment approach: it seems worthwile to develop more efficient algorithms for the optimisation problem that arises in the context of the DIALIGN algorithm. In other situations, the numerical scores of biologically correct alignments turned out to be below the scores of biololgically wrong alignments returned by the non-anchored version of our program. Here, improved optimisation functions will not lead to biologically more meaningful alignments. It is therefore also promising to develop improved objective function for our alignment approach.