2.1 Testing the classifier Classification performance is evaluated using 30920 automatically generated ClustalW alignments of 313 of the 503 ncRNA families from RFAM (version 7.0). All sequences attending at the training alignments were excluded from the test set. For each family at most 500 ClustalW alignments were randomly constructed each for 2 to 6 sequences, resulting in maximal 2500 alignments for a family. Since the alignments which were taken to train the SVM are no longer than 400 nt, have a minimal pairwise sequence identity of 60% and contain maximal six sequences, test alignments were created which meet the same criteria. For alignments which do not fall into those ranges probability estimates of the SVM need to be regarded with certainty. 8 families had no alignments between 40 and 400 nt and were hence discarded from the test set. 67 families are not included because they consist of only one or two sequences. 2 families had no sampled alignments with a mean pairwise sequence identity larger than 60%. Lastly, the sampled alignments of 113 families were not recognized as ncRNA by RNAz on at least one reading direction and were also discarded from the test data set. A list of families excluded from the test data can be found in the supplementary material (see Additional file 1). All alignments in the test set were used as positive test cases and their realigned reverse complements as negative test cases. Table 1 lists the classification rates for different threshold values c, i.e., classifying the RNA as "plus strand" for D > c and as "minus strand" for D < -c, while -c ≤ D ≤ c is interpreted as "undecided". We observe only a negligible loss of accuracy when c is increased from 0 to 0.9. The distribution of D (see Additional file 1) demonstrates that the majority of alignments are classified correctly with high probability. However, RNAstrand fails to predict the correct reading direction of 53 families (e.g. 7SK). The predicted secondary structure of the reverse complementary alignment is much more stable for these examples than the ncRNA itself (see Additional file 1). On the other hand, RNAstrand is able to reliably capture the reading direction of most ncRNAs for which no representative was given in the training set, including RNase MRP, IRES, SECIS and 5.8S rRNA, which makes it suitable to predict the reading direction of novel ncRNA families. Table 1 Evaluation of RNAstrand. c = 0 c = 0.5 c = 0.9 ncRNA type N a N c A A + A - A 1-A-u u A 1-A-u u A(RNAz) Alignments classified as structured RNA by RNAz 5S rRNA 413 1 0.990 0.993 0.988 0.978 0.006 0.016 0.958 0.000 0.042 0.973 5.8S rRNA 146 1 0.932 0.932 0.932 0.894 0.055 0.051 0.733 0.024 0.243 0.904 tRNA 286 1 0.948 0.948 0.948 0.886 0.017 0.096 0.621 0.009 0.371 0.535 miRNA 1875 43 0.981 [0.241] 0.979 [0.246] 0.982 [0.238] 0.965 [0.261] 0.009 [0.171] 0.026 [0.147] 0.906 [0.373] 0.001 [0.003] 0.094 [0.372] 0.187 [0.376] snoRNA (C/D) 946 71 0.780 [0.376] 0.785 [0.374] 0.775 [0.389] 0.732 [0.411] 0.190 [0.363] 0.078 [0.256] 0.618 [0.431] 0.147 [0.286] 0.235 [0.416] 0.654 [0.446] snoRNA (H/ACA) 3066 53 0.909 [0.198] 0.908 [0.198] 0.909 [0.199] 0.882 [0.255] 0.062 [0.160] 0.056 [0.184] 0.823 [0.352] 0.021 [0.039] 0.156 [0.339] 0.899 [0.283] spliceos. RNA 896 6 0.877 [0.252] 0.885 [0.251] 0.868 [0.254] 0.831 [0.327] 0.086 [0.212] 0.083 [0.118] 0.735 [0.322] 0.042 [0.125] 0.222 [0.202] 0.835 [0.257] euk. SRP RNA 891 1 0.997 0.998 0.996 0.992 0.001 0.007 0.972 0.000 0.028 0.841 nucl. RNaseP 31 1 0.694 0.710 0.677 0.613 0.274 0.113 0.387 0.081 0.532 0.290 RNase MRP 140 1 0.989 0.986 0.993 0.982 0.000 0.018 0.961 0.000 0.039 0.500 IRES 170 8 0.715 [0.453] 0.718 [0.455] 0.712 [0.452] 0.647 [0.469] 0.200 [0.424] 0.153 [0.339] 0.597 [0.448] 0.106 [0.433] 0.297 [0.402] 0.318 [0.424] SECIS 76 1 0.651 0.658 0.645 0.520 0.257 0.224 0.329 0.191 0.480 0.487 7SK 184 1 0.041 0.043 0.038 0.024 0.916 0.060 0.011 0.802 0.188 0.038 Alignments not classified as structured RNA by RNAz 5S rRNA 525 1 0.793 0.821 0.766 0.717 0.130 0.153 0.552 0.057 0.390 - 5.8S rRNA 1000 1 0.853 0.892 0.814 0.771 0.092 0.137 0.602 0.032 0.366 - tRNA 1 1 1/1 1/1 1/1 1/1 0/1 0/1 1/1 0/1 0/1 - miRNA 0 - - - - - - - - - - - snoRNA (C/D) 4228 105 0.563 [0.397] 0.595 [0.399] 0.532 [0.414] 0.480 [0.420] 0.353 [0.363] 0.167 [0.236] 0.340 [0.394] 0.245 [0.316] 0.415 [0.364] - snoRNA (H/ACA) 1993 36 0.788 [0.251] 0.812 [0.244] 0.763 [0.291] 0.735 [0.314] 0.157 [0.203] 0.108 [0.233] 0.644 [0.370] 0.081 [0.169] 0.274 [0.339] - spliceos. RNA 2944 4 0.632 [0.287] 0.669 [0.287] 0.595 [0.289] 0.560 [0.314] 0.301 [0.261] 0.139 [0.071] 0.422 [0.338] 0.203 [0.200] 0.375 [0.180] - euk. SRP RNA 3 1 3/3 3/3 3/3 3/3 0/3 0/3 3/3 0/3 0/3 - nucl. RNaseP 2 1 2/2 2/2 2/2 2/2 0/2 0/2 1/2 0/2 1/2 - RNase MRP 0 - - - - - - - - - - - IRES 265 13 0.506 [0.454] 0.521 [0.454] 0.491 [0.454] 0.468 [0.411] 0.457 [0.450] 0.075 [0.276] 0.436 [0.401] 0.353 [0.411] 0.211 [0.418] - SECIS 43 1 0.686 0.698 0.674 0.593 0.174 0.233 0.302 0.070 0.628 - 7SK 630 1 0.127 0.152 0.102 0.063 0.798 0.139 0.018 0.640 0.342 - Na: number of alignments in test set, Nc: number of different RNA classes, A: accuracy, which is defined as the fraction of correctly classified input alignments, A+: accuracy of alignments in reading direction of ncRNA, A-: accuracy of reverse complementary alignments, u: fraction of undecided alignments, 1 - A - u: fraction of misclassified alignments, A(RNAz): fraction of alignments correctly classified by taking the strand with the largest RNAz probability as the strand of the ncRNA. Standard deviations for RNA families with alignments from different classes are given in brackets. Note, that in case c = 0 no undecided alignments are observed. To evaluate the performance of RNAstrand on alignments which have not been identified as structured RNA by RNAz, we constructed a second test set which only consists of alignments not classified as structured RNA by RNAz in both reading directions. This resulted in 207 families meeting the criteria described in the first paragraph of this section. The corresponding distributions are shown in the supplementary material (see Additional file 1). For those alignments a dramatic decrease of structure stability and conservation is observed which leads to smaller descriptor values (see Additional file 1). Hence, the classification performance is worse compared to RNAz-positive alignments (Table 1). However, for the majority of alignments the correct reading direction was inferred. Performance measures depending on the number of sequences in the input alignment, the length as well as the mean pairwise identity of the sequences are given in Table 2. The number of sequences of an alignment does not influence prediction performance significantly. But the more the sequences are conserved the better the overall classification accuracy. The fraction of correctly classified alignments is also very high in case of long sequences. For alignments of 100 to 200 nt length the accuracy is biased to miRNAs, which are well classified by RNAstrand. Table 2 Accuracies depending on different alignment features. c = 0 alignment feature N A A + A - NS = 2 4487 0.824 0.829 0.819 NS = 3 5311 0.833 0.830 0.837 NS = 4 6388 0.828 0.830 0.827 NS = 5 7234 0.797 0.805 0.789 NS = 6 7500 0.832 0.835 0.829 50 ≤ sequence identity < 70 13187 0.799 0.799 0.799 70 ≤ sequence identity < 80 12152 0.827 0.832 0.823 80 ≤ sequence identity < 90 5550 0.865 0.871 0.859 90 ≤ sequence identity < 100 31 0.903 0.871 0.935 40 ≤ length ≤ 100 11191 0.768 0.773 0.763 101 ≤ length ≤ 200 14180 0.853 0.856 0.851 201 ≤ length ≤ 300 1697 0.637 0.641 0.634 301 ≤ length ≤ 400 3852 0.945 0.945 0.945 all alignments 30920 0.822 0.825 0.819 Performance of RNAstrand depending on various alignment features, i.e. number of sequences (NS), sequence identity and alignment length. N : number of alignments in the test sets, A: accuracy, which is defined as the fraction of correctly classified input alignments, A+: accuracy of alignments in reading direction of ncRNA, A-: accuracy of reverse complementary alignments. The results highlight that our classification task has an intrinsic symmetry: the fraction of correctly classified alignments for the "plus strand" of a ncRNA should be similar to the accuracy of the "minus strand". However, we observe a small but noticeable bias to predict that the ncRNA lies in same reading direction as the input alignment (Table 1). The SVM model was trained with different alignments in the positive and negative training sets, which results in an asymmetric model. If the same alignments, but in different directions, were taken for training, the SVM model would be exactly symmetric. But training data should be independent in the different classes, hence we refrained from enforcing this exact symmetry to avoid potential overtraining artifacts. Another possibility to avoid asymmetry would be to take the averaged SVM decision values of both reading directions as the final decision. But this has an unknown effect on the probability estimates. The distribution of decision values of the SVM is shown in Fig. 3. The majority of alignments were classified correctly. Most of them have large absolute decision values stating that they belong to the corresponding class with high probability. If RNAstrand is applied to shuffled alignments the decision values are more concentrated around 0, but most of them are still classified correctly. To explain this observation we checked which combination of descriptors performs best on shuffled alignments. We trained a SVM model for each possible descriptor combination and calculated the true and false positive rates at different decision levels by using plotroc.py of the libsvm 2.8 package [18]. The corresponding ROC curves are given in Fig. 4 and indicate that except of Δmeanmfe all descriptors classify shuffled alignments randomly. Individual shuffled sequences, presumably by virtue of their base composition (see Additional file 1), still contain information on the reading direction of the structured RNA which is captured by Δmeanmfe. This observation implies that RNAstrand must not be used for alignments that do not contain structured RNAs. In other words, RNAstrand cannot be used to infer an ncRNA on the grounds that it returned a preferred reading direction for a non-structured input alignment. We could have also removed Δmeanmfe from the set of descriptors, because of this bias. However, due to its high sensitivity (Fig. 1) it seems preferable to keep it as descriptor, in particular since RNAstrand is designed to operate on structured RNAs only. The best cutoff c can be found by plotting false positive rates versus true positive rates at different c (Fig. 5). If Youden's index Y, i.e., true positive rate minus false positive rate, is maximal, then the classification accuracy cannot be further improved by taking a larger cutoff [19]. We observe Ymax ≈ 0.644 for c ≤ 0.15. Hence, a further increase of c leads to a worse proportion of correctly and falsely classified alignments. However, a large value of c assures that the predicted reading direction is with high probability the correct reading direction, see Table 1 and the r.h.s. of Fig. 5. Figure 3 Histogram of SVM decision values. Distribution of SVM decision values of RNAz-positive alignments. The upper histogram belongs to all alignments of the test set. Whereas the lower one shows the distribution of the decision values for shuffled alignments. Columns of the test alignments were randomly permuted to create shuffled alignments. Red dotted bins denote alignments where the ncRNA has the same reading direction as the alignment. Black bins belong to alignments where the ncRNA is contained in the reverse complement. Note that the shuffling procedure does not completely destroy the direction information. Figure 4 Receiver operating characteristic of all descriptor combinations for shuffled alignments. ROC curves of all descriptor combinations for shuffled alignments. Columns of test alignments were randomly permuted to create shuffled alignments. Corresponding AUC is given in brackets. ROC curves were computed by training a SVM model for each descriptor combination and testing the model on shuffled alignments by utilizing plotroc.py of the libsvm 2.8 package [18]. Training was done with the original training set for RNAstrand. SVM parameter and kernel did not change, i.e. a radial basis function kernel with parameters C = 128 and γ = 0.5 were used. Figure 5 Receiver operating characteristic of test alignments. False positive rates of RNAz-positive test alignments versus true positive rates at different cutoff levels c. The left plot depicts rates in case undecided alignments are included in the calculation. Meaning that the true positive rate is defined as tptp+fn+u MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsha0jabdchaWbqaaiabdsha0jabdchaWjabgUcaRiabdAgaMjabd6gaUjabgUcaRiabdwha1baaaaa@3861@, where tp denotes alignments which have been correctly classified to contain the ncRNA in the same reading direction as the input alignment. fn is the number of alignments which have been falsely classified to contain the ncRNA on the reverse complement, while u contains all alignments which contain the ncRNA in the same reading direction but RNAstrand were not able to predict a reading direction. False positive rate is defined respectively. The right handed plot discards unclassified alignments. Hence, the true positive rate is defined as tptp+fn MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsha0jabdchaWbqaaiabdsha0jabdchaWjabgUcaRiabdAgaMjabd6gaUbaaaaa@360C@ and the false positive rate as fpfp+tn MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdAgaMjabdchaWbqaaiabdAgaMjabdchaWjabgUcaRiabdsha0jabd6gaUbaaaaa@35F0@. The curves for both SVM decision classes are given. Red curves denote alignments containing the ncRNA in the reading direction of the input alignment. Black curves belong to alignments which contain the ncRNA on the reverse complementary strand. The values of c range from 0 to 0.95 in steps of 0.05.