Applications to Hox gene clusters As explained above, tandem duplications pose a hard problem for automatic alignment algorithms. Clusters of such paralogous genes are therefore particularly hard to align. As a real-life example we consider here the Hox gene clusters of vertebrates. Hox genes code for homeodomain transcription factors that regulate the anterior/posterior patterning in most bilaterian animals [26,27]. This group of genes, together with the so-called ParaHox genes, arose early in metazoan history from a single ancestral "UrHox gene" [28]. Their early evolution was dominated by a series of tandem duplications. As a consequence, most bilaterians share at least eight distinct types (in arthropods, and 13 or 14 in chordates), usually referred to as paralogy classes. These Hox genes are usually organised in tightly linked clusters such that the genes at the 5'end (paralogy groups 9–13) determine features at the posterior part of the animal while the genes at the 3'end (paralogy groups 1–3) determine the anterior patterns. In contrast to all known invertebrates, all vertebrate lineages investigated so far exhibit multiple copies of Hox clusters that presumably arose through genome duplications in early vertebrate evolution and later in the actinopterygian (ray finned fish) lineage [29-33]. These duplication events were followed by massive loss of the duplicated genes in different lineages, see e.g. [34] for a recent review on the situation in teleost fishes. The individual Hox clusters of gnathostomes have a length of some 100,000nt and share besides a set of homologous genes also a substantial amount of conserved non-coding DNA [35] that predominantly consists of transcription factor binding sites. Most recently, however, some of these "phylogenetic footprints" were identified as microRNAs [36]. Figure 2 and 3 show four of the seven Hox clusters of the pufferfish Takifugu rubripes. Despite the fact that the Hox genes within a paralogy group are significantly more similar to each other than to members of other paralogy groups, there are several features that make this dataset particularly difficult and tend to mislead automatic alignment procedures: (1) Neither one of the 13 Hox paralogy groups nor the Evx gene is present in all four sequences. (2) Two genes, HoxC8a and HoxA2a are present in only a single sequence. (3) The clusters have different sizes and numbers of genes (33481 nt to 125385 nt, 4 to 10 genes). Figure 2 The pufferfish Takifugu rubripes has seven Hox clusters of which we use four in our computational example. The Evx gene, another homedomain transcription factor is usually liked with the Hox genes and can be considered as part of the Hox cluster. The paralogy groups are indicated. Filled boxes indicates intact Hox genes, the open box indicates a HoxA7a pseudogene [45]. Figure 3 Result of a DIALIGN run on the Hox sequences from Figure 2 without anchoring. The diagram represents sequences and gene positions to scale. All incorrectly aligned segments (defined as parts of a gene that are aligned with parts of gene from a different paralogy group) are indicated by lines between the sequences. We observe that without anchoring DIALIGN mis-aligns many of of the Hox genes in this example by matching blocks from one Hox gene with parts of a Hox gene from a different paralogy group. As a consequence, genes that should be aligned, such as HoxA1Oa and HoxDIOa, are not aligned with each other. Anchoring the alignment, maybe surprisingly, increases the number of columns that contain aligned sequence positions from 3870 to 4960, i.e., by about 28%, see Table 2. At the same time, the CPU time is reduced by almost a factor of 3. We investigated not only the biological quality of the anchored and non-anchored alignments but also looked at their numerical scores. Note that in DIALIGN, the score of an alignment is defined as the sum of weight scores of the fragments it is composed of [17]. For some sequence sets we found that the score of the anchored alignment was above the non-anchored alignment while for other sequences, the non-anchored score exceeded the anchored one. For example, with the sequence set shown in Figure 2, the alignment score of the – biologically more meaningful – anchored alignment was > 13% below the non-anchored alignment (see Table 1). In contrast, another sequence set with five HoxA cluster sequences (TrAa, TnAa, DrAb, TrAb, TnAb) from three teleost fishes (Takifugu rubripes, Tr; Tetraodon nigroviridis, Tn; Danio rerio, Dr) yields an anchored alignment score that is some 15% above the non-anchored score. Table 1 Effect of different anchors in the Fugu example of Figure 2. We consider aligned sequence positions in intergenic regions (i.e., outside the coding regions and introns) only. Column 2 gives the number of sequence positions for which DIALIGN added at least one additional sequence that was not represented in original TRACKER footprint. Column 3 lists the total number of nucleotides in footprints that were not detected by tracker but were aligned by anchored DIALIGN. anchor nt positions in footprints total expanding new none 1546 0 618 genes 1686 39 694 genes and BLASTZ hits 2433 39 841 Table 2 Aligned sequence positions that result from fragment aligments in the Fugu Hox cluster example. To compare these alignments, we counted the number of columns where two, three or four residues are aligned, respectively. Here, we counted only upper-case residues in the DIALIGN output since lower-case residues are not considered to be aligned by DIALIGN. The number of columns in which two or three residues are aligned increases when more anchors are used, while the number of columns in which all sequences are aligned decreases. This is because in our example no single Hox gene is contained in all four input sequences, see Figure 2. Therefore a biologically correct alignment of these sequences should not contain columns with four residues. CPU times are measured on a PC with two Intel Xeon 2.4GHz processors and 1 Gbyte of RAM. anchor alignment length aligned sequences CPU time score 2 3 4 none 281759 2958 668 244 4:22:07 1166 genes 252346 3674 1091 195 1:18:12 1007 BLASTZ hits 239326 4036 1139 33 0:19:32 742