Reference alignments
For the construction of reference alignments we used "seed" alignments from the Rfam database version 7.0 [24,23]. In most cases these alignments are hand-curated and thus of higher quality than Rfam's "full" alignments generated automatically by the INFERNAL RNA profile package [40]. Alignments with less than 50 sequences were discarded to increase the possibility for creation of subalignments (see below). The SCI (see below) for scoring of structural alignment quality is based on a combination of thermodynamic and covariation measures. Thermodynamic structure prediction becomes increasingly inaccurate with increasing sequence length – e. g. due to kinetic effects – but is widely regarded as sufficiently accurate for sequences not exceeding 300 nt in length [41,42]. Thus we excluded alignments with an average sequence length above 300 nt to ensure proper thermodynamic scoring.
To each remaining seed alignment we applied a "naive" combinatorial approach that extracts sub-alignments with k ∈ {2, 3, 5, 7, 10, 15} sequences for a given average pairwise sequence identity range (APSI; a measure for sequence homology computed with ALISTAT from the squid package [43]). Therefore we computed identities for all sequence pairs from an alignment and selected those pairs possessing the desired APSI ± 10 %. From the remaining list of sequences we randomly picked k unique sequences. Additionally we dropped all alignments with an SCI below 0.6 to assure the structural quality of the alignments and to make sure that the SCI can be applied later to score the test alignments. This way we generated overall 18,990 reference alignments with an average SCI of 0.93; the data-set1 used in [22] consists of only 388 alignments with an average SCI of 0.89. For further details see Tables 1 and 6.
Table 6  Number of reference alignments for each RNA family
RNA family   k2   k3   k5   k7   k10   k15  ∑
5S_rRNA  1162  568  288  150  90  50  2308
5_8S_rRNA  76  45  17  5  3  0  146
Cobalamin  188  61  15  4  0  0  268
Entero_5_CRE  48  32  19  10  8  5  122
Entero_CRE  65  38  20  13  8  4  148
Entero_OriR  49  31  17  11  8  4  120
gcvT  167  67  22  12  3  1  272
Hammerhead_1  53  32  9  1  0  0  95
Hammerhead_3  126  99  52  32  17  12  338
HCV_SLIV  98  63  36  26  16  10  249
HCV_SLVII  51  33  19  13  10  7  133
HepC_CRE  45  29  18  11  7  3  113
Histone3  84  59  27  11  7  6  194
HIV_FE  733  408  227  147  98  56  1669
HIV_GSL3  786  464  246  151  95  61  1803
HIV_PBS  188  124  76  55  38  25  506
Intron_gpII  181  82  35  22  11  4  335
IRES_HCV  764  403  205  146  83  47  1648
IRES_Picorna  181  117  75  53  35  25  486
K_chan_RES  124  40  2  0  0  0  166
Lysine  80  48  30  17  7  3  185
Retroviral_psi  89  57  34  24  17  11  232
SECIS  114  67  33  16  11  6  247
sno_14q I_II  44  14  1  0  0  0  59
SRP_bact  114  76  39  19  12  7  267
SRP_euk_arch  122  94  42  21  12  6  297
S_box  91  51  25  12  7  2  188
T-box  18  8  0  0  0  0  26
TAR  286  165  92  62  42  28  675
THI  321  144  69  32  17  5  588
tRNA  2039  1012  461  267  143  100  4022
U1  82  65  26  16  6  0  195
U2  112  83  38  22  14  7  276
U6  30  21  14  7  1  0  73
UnaL2  138  71  43  20  7  0  279
yybP-ykoY  127  64  33  18  12  8  262
∑  8976  4835  2405  1426  845  503  18990