2.2 Comparison to naïve approaches
A naïve way to determine the likely reading direction is to score an alignment and its reverse complement using RNAz, EvoFold, or another tool for recognizing structured RNAs. This approach was taken e.g. in [1,2,4,5]. A manual inspection of the data, however, showed that this approach is problematic in particular in those cases where RNAz scores are high for both reading directions. This is the case in particular for microRNA precursors, but also for many other small house-keeping ncRNAs.
Table 1 gives the accuracy of RNAstrand compared to this simple approach, i.e., taking the strand with the larger RNAz probability. RNAstrand yields for all ncRNA types an improvement. The largest increase of classification accuracy is observed for miRNAs, RNase MRP, tRNAs, nuclear RNaseP and IRES. Table 3 shows that the reading direction is classified correctly in the majority of test alignments by RNAstrand. The misclassification rate of the naïve approach is two times higher than that of RNAstrand.
Finally, we compared the prediction accuracy of RNAstrand with the strand prediction of EvoFold. Applying EvoFold to automatically created RNA alignments extracted from Rfam families is not easily feasible since EvoFold requires a meaningful phylogenetic tree (ideally estimated from neutrally evolving sites) as input. Such data are not available and cannot be generated easily for most combinations of Rfam sequences. The heuristic suggested in [2], namely to rescale a neighbor-joining tree generated from the input alignment, produced very poor classification results in most cases.
Table 3  Comparison of classification accuracies versus RNAz.
Naïve RNAz-based classification
correct  incorrect
RNAstrand  fwd  correct  17961  7579
incorrect  1570  3810
rev  correct  17855  7521
incorrect  1676  3868
all  correct  35816  15100
incorrect  3246  7678
Strand prediction of RNAstrand compared to naïve prediction of RNAz. The first row of the table refers to alignments of known ncRNA loci given in the direction of the ncRNA. The second row belongs to the corresponding reverse complementary alignments. The last row summarizes the first and second row. Hence, we use instead the subset of known ncRNAs among the 48479 EvoFold predictions in human assembly hg17 [2].
A blast search with E < 1e - 10 against NonCode [20], Rfam [21], mirBase [22] and snoRNA-LBME-db [23] identified only 248 unique known ncRNA loci in human. (Note, that tRNAs and most snRNAs are multi-copy genes and hence were deliberately excluded from the data in [2]). To compare strand predictions of EvoFold with RNAstrand the multiz8way alignments of 202 loci, which are completely covered by a blast hit, were reconstructed. The majority (177) were identified to be miRNA precursors as most of the EvoFold predictions in ref. [2] are short conserved hairpins. The direction of the blast hit indirectly determines the strand of the known ncRNA when it is compared to the strand prediction of EvoFold. For 14 (13 miRNAs and 1 U6atac) loci the multiple alignments could not be reconstructed. The remaining 188 alignments were realigned and all which did not meet the prerequisites of RNAstrand were discarded: 15 alignments were shorter than the minimum length for which RNAstrand was trained with, 5 alignments had a mean pairwise identity smaller than 50%, and one alignment contained of too many gaps. This leaves 167 alignments for which the strand prediction of RNAstrand is compared to the strand prediction of EvoFold. Alignments containing more than 6 sequences were reduced to 6 sequences by rnazWindow.pl which optimizes the final alignment for a mean pairwise identity.
The numbers in Table 4 show that the strand prediction of EvoFold is comparable to the strand prediction of RNAstrand on this relative small test set, which is, however, dominated by microRNAs. We remark that EvoFold and RNAz are sensitive for ncRNAs of different base compositions and sequence similarities [3,24], so that neither of these programs can be (ab)used as universal strand-strand classificators.
Table 4  Comparison of classification accuracies versus EvoFold.
Naïve EvoFold-based classification
correct  incorrect
RNAstrand  fwd  correct  123 [111;12]  16 [15;1]
incorrect  17 [17; 0]  11 [8;3]
rev  correct  121 [109;12]  12 [11;1]
incorrect  19 [19; 0]  15 [12;3]
all  correct  244 [220;24]  28 [26;2]
incorrect  36 [36; 0]  26 [20;6]
Strand prediction of RNAstrand compared to naïve prediction of EvoFold. The first row of the table refers to alignments of known ncRNA loci given in the direction of the ncRNA. The second row belongs to the corresponding reverse complementary alignments. The last row summarizes the first and second row. First numbers in brackets give classifications of alignments containing miRNAs and second numbers belong to alignments containing other ncRNAs.