Results and discussion
At first we established an extended RNA alignment database for benchmarking (BRAliBase 2.1) as described in Methods. The datasets are based on (hand-curated) seed alignments of 36 RNA families taken from Rfam version 7.0 [24,23]. Thus, the BRAliBase 2.1 contains in total 18,990 aligned sets of sequences; the individual sets consist of 2, 3, 5, 7, 10, and 15 sequences, respectively (see Table 1), with 20 ≤ APSI ≤ 95 %.
To test the performance of an alignment program or the influence of program parameters on performance, we removed all gaps from the datasets, realigned them by the program to be tested, and scored the new alignments by a modified sum-of-pairs score (SPS') and the structure conservation index (SCI). The SPS' scores the identity between test and reference alignments, whereas the SCI scores consensus secondary structure information; for details see Methods. Both scores were multiplied to yield the final RNA alignment score, termed BRALISCORE. For the ranking of program parameters and options of individual programs, or of different programs we used Friedman rank sum and Wilcoxon signed rank tests; for details see Methods. Different program options or even different programs resulted in only small differences in alignment quality for datasets of APSI above 80 %, which is in accordance with the previous benchmark results [22]. Because the alignment problem seems to be almost trivial at these high identities and in order to reduce the number of alignments that need to be computed, we report all results only on datasets with APSI ≤ 80 %.
Table 1  Number of reference alignments and average Structure Conservation Index (SCI) for each alignment of k sequences.
k2  k3  k5  k7  k10  k15  total
no. aln.  8976 (118)  4835  2405 (481)  1426  845  504  18990
∅ SCI  0.95 (1.05)  0.92  0.91 (0.87)  0.90  0.89  0.89  0.93
Values for the previously used data-set1 [22] are given in brackets.

Optimizing gap costs
With the existence of reference alignments specifically compiled for the purpose of RNA alignment benchmarks, program parameters can be specifically optimized for RNA alignments.
Parameters for MAFFT version 5 [25] have been optimized by K. Katoh using BRAliBase II's data-set1 [22]. The gap-cost values of MAFFT version 4 (gap-open penalty op = 0.51 and gap-extension penalty ep = 0.041) turned out to be far too low. Applying the improved values (op = 1.53 and ep = 0.123; these are the default in versions ≥ 5.667) to the new BRAliBase 2.1 datasets results in a dramatic performance gain (exemplified in Figure 1 for alignment sets with five sequences). Similarly, parameters for MUSCLE [16,26] have been optimized by its author.
Figure 1  MAFFT (FFT-NS-2) and ClustalW performance with optimized and old parameters. PROALIGN (earlier identified to be a good aligner [22]) is included as a reference. Performance is measured as BRALISCORE vs. reference APSI and exemplified for k = 5 sequences. MAFFT version 5.667 was used with optimized parameters, which are default in version 5.667, and with (old) parameters of version 4, respectively; CLUSTALW was used either with default parameters or with optimized parameters (see Table 2 and text). Motivated by the successful optimizations of MAFFT and MUSCLE parameters, we searched for optimal gap-costs of CLUSTALW [27,28]. We varied gap-open (go) and gap-extension (ge) penalties from 7.5 to 22.5 and from 3.33 to 9.99, respectively (default values of CLUSTALW for RNA/DNA sequences are go = 15.0 and ge = 6.66, respectively). Ranks derived by Friedman tests are averaged over all alignment sets, i. e. consisting of 2, 3, 5, 7, 10, and 15 sequences. Table 2 summarizes the results. Alignments created with higher gap-open penalties score significantly better. A combination of go = 22.5 and ge = 0.83 is optimal for the tested parameter range. It should be noted that this performance gain results mainly from a better SCI, whereas the SPS' remains almost the same.
Table 2  Averaged ranks derived from Friedman rank sum tests for ClustalW's gap parameter optimization.
ge
go  0.42  0.83  1.67  3.33  4.99  6.66  8.32  9.99
7.5  56.0  55.0  54.0  53.0  51.2  50.0  47.0  42.8
11.25  47.5  44.0  41.5  37.2  34.5  27.3  28.2  31.5
15.0  20.8  24.0  20.0  14.5  13.5   15.5  22.3  29.3
18.75  10.8  8.3  8.2  7.5  11.3  20.8  27.5  35.8
22.5  4.7   2.8  3.7  8.8  17.7  27.0  34.5  39.2
26.25  5.8  5.5  8.8  17.5  31.2  36.7  42.3  46.2
30.0  15.2  17.2  22.8  32.8  39.3  45.0  49.0  51.5
Ranks (smaller values mean better performance) for each gap-open (go)/gap-extension (ge) penalty combination are based on the BRALISCORE averaged over all alignment sets with k ∈ {2, 3, 5, 7, 10, 15} sequences and APSI ≤ 80 %. CLUSTALW's default and the optimized value combinations are given in bold-face. Similarly we optimized gap values for the recently published PRANK [29]. Average ranks can be found in Table 3. Default values (go = 0.025 and ge = 0.5) are too high. Due to time reasons we did not test all parameter combinations; optimal values found so far are 10 times lower than the default values. One should bear in mind that Friedman rank tests do not indicate to which degree a particular program or option works better, but that it consistently performs better. The actual performance gain can be visualized by plotting BRALISCORE vs. reference APSI (see Figure 1). For MAFFT the new options result in an extreme performance gain whereas CLUSTALW gap parameter optimization only yields a modest improvement indicating that CLUSTALW default options are already near optimal. In both cases the influence of optimized parameters has its greatest impact at sequence identities ≤ 55% APSI.
Table 3  Averaged ranks derived from Friedman rank sum tests for prank's gap parameter optimization.
ge
go  0.05  0.125  0.1875  0.25  0.375  0.5
0.0025  3.5  2.0  4.8  NA  NA  NA
0.00625  6.8  3.5  3.2  NA  NA  NA
0.00938  8.8  6.5  8.0  NA  NA  NA
0.0125  NA  NA  NA  8.2  11.0  13.5
0.01875  NA  NA  NA  12.8  12.5  15.8
0.025  NA  NA  NA  15.8  17.2   19.0
0.03125  NA  NA  NA  20.0  22.0  23.8
0.0375  NA  NA  NA  25.0  27.0  27.8
Ranks (smaller values mean better performance) for each gap-open (go)/gap-extension (ge) value combination are averaged over all alignment sets with k ∈ {5, 7, 10, 15} sequences and APSI ≤ 80 %. The default option for PRANK version 1508b is given in bold-face. Values for sets k2 and k3 are missing because PRANK crashed repeatedly with these sets, but we needed all values to compute the Friedman tests.

Choice of substitution matrices
Each alignment program has to use a substitution matrix for replacement of characters during the alignment process. Traditionally these matrices differentiate between transitions (purine to purine and pyrimidine to pyrimidine substitutions) and transversions (purine to pyrimidine and vice versa), but more complex matrices have been described in the literature. An example for the latter are the RIBOSUM matrices [30] used by RSEARCH to score alignments of single-stranded regions. To address the question whether incorporating RIBOSUM matrices results in a significant performance change, we used the RIBOSUM 85–60 4 × 4 matrix as substitution matrix for CLUSTALW, ALIGN-M and POA, as these programs allow an easy integration of non-default substitution matrices via command line options. Since gap-costs and substitution matrix values are interdependent we adjusted the original RIBOSUM values to the range of the default values. We applied Wilcoxon tests to test whether using the RIBOSUM matrix (instead of the simpler default matrices) yields a statistical significant performance change. Results are summarized in Table 4. POA and ALIGN-M perform significantly better, only CLUSTALW's performance suffers from RIBOSUM utilization. The reason for CLUSTALW's performance loss is not obvious to us; it might be that CLUSTALW's dynamic variation of gap penalties in a position and residue specific manner [27] works optimally only with CLUSTALW's default matrix. Furthermore, the RIBOSUM 4 × 4 matrix is based on nucleotide substitutions in single-stranded regions whereas we used it as a general substitution matrix. Other matrices, based on base-paired as well as loop regions from a high-quality alignment of ribosomal RNA [31], gave, however, no significantly different results (data not shown).
Table 4  Comparison of default vs. RIBOSUM substitution matrix by Wilcoxon tests
Program  k2  k3  k5  k7  k10  k15
ALIGN-M  /  +  +  +  /  /
CLUSTALW  -  -  -  -  -  -
POA  +  +  +  /  /  /
If the use of the RIBOSUM 85–60 matrix resulted in a statistically significant performance increase in comparison to use of the default matrix this is indicated with a "+"; "-" indicates that the default matrix scores significantly better. If no statistical significance was found this is indicated with a "/".

Effect of sequence number on performance
A major improvement of the BRAliBase 2.1 datasets compared to BRAliBase II is the increased range of sequence numbers per set. This allows, for example, to test the influence of sequence number on performance of alignment programs.
It has already been shown that iterative alignment strategies generally perform better than progressive approaches on protein alignments [10]. The same is true for RNA alignments: with increasing number of sequences and decreasing sequence homology iterative programs perform relatively better compared to non-iterative approaches. Figure 2 demonstrates this for PRRN – a representative for an iterative alignment approach – and CLUSTALW as the standard progressive, non-iterative alignment program. The effect is again most notable in the low sequence identity range (APSI < 0.55). In this range, alignment errors occur that can be corrected during the refinement stage of iterative programs. The same can be demonstrated for other iterative vs. non-iterative program combinations like MAFFT or MUSCLE vs. POA or PROALIGN etc. (see supplementary plots on our website [32]).
Figure 2  Performance of Prrn compared to ClustalW in dependence on sequence number per alignment. The plot shows the difference of the scores of PRRN as a representative of an iterative alignment approach and CLUSTALW (standard options) as a representative of a progressive approach.

Relative performance of RNA sequence alignment programs
To find the sequence alignment program that performs best under all non-trivial situations (e. g. reference APSI ≤ 80 %), we did a comparison of all those programs previously identified [22] to be top ranking. If available we used the newest program versions and optimized parameters. In the comparison we included the RNA version of PROBCONS [33] (PROBCONSRNA; see [34]) whose parameters have been estimated via training on the BRAliBase II datasets. We applied Friedman rank sum tests to each alignment set with a fixed number of sequences. Results are summarized in Table 5. MAFFT version 5 [25] with the option "G-INS-i" ranks first throughout all test-sets. This option is suitable for sequences of similar lengths, recommended for up to 200 sequences, and uses an iterative (COFFEE-like [35]) refinement method incorporating global pairwise alignment information. This option clearly outperforms the default option "FFT-NS-2", which uses only a progressive method for alignment. MUSCLE and PROBCONSRNA rank second and third place.
Table 5  Ranks determined by Friedman rank sum tests for all top-ranking programs.
Program/Option  k2  k3  k5  k7  k10  k15
CLUSTALW (default)  8  7  8  8  7  7
CLUSTALW (optimized)  6  6  7  7  6  6
MAFFT (FFT-NS-2)  2  4  4  4  5  5
MAFFT (G-INS-i)  1  1  1  1  1  1
MUSCLE  3  3  3  2  2  2
PCMA  9  10  10  10  10  10
POA  7  8  9  9  9  9
PROALIGN  5  5  6  6  8  8
PROBCONSRNA  4  2  2  3  3  4
PRRN  10  9  5  5  4  3
Programs were ranked according to BRALISCORE averaged over all alignment sets with k ∈ {2, 3, 5, 7, 10, 15} sequences and APSI ≤ 80 %. MAFFT (G-INS-i) is the top performing program on all test sets. For program versions and options see Methods.