1.1 Selection of descriptors
Small differences in the measured folding energies between an RNA molecule and its reverse complement are captured by corresponding small asymmetries in the standard energy model used by thermodynamic folding algorithms [9,10]. These differences distinguish the two reading directions even in the absence of GU pairs. In addition, GU pairs have an asymmetric effect in multiple sequence alignments: Suppose a particular pair of alignment columns exhibits a GC → GU substitution in one reading direction; this preserves base pairing and hence is consistent with a conserved structure. The reverse complement of the same alignment, however, displays a GC → AC substitution which is inconsistent with a conserved base pair. The patterns of structure conservation, and hence the consensus structure and its associated average folding energy, as computed by the RNAalifold algorithm [11], thus differ between the reading directions. In contrast, compensatory mutations, such as GC → AU do not provide strand-specific information.
The effects of both the asymmetries of the energy rules and of the GU base pairs are conveniently captured in terms of thermodynamic quantities, more precisely, in terms of the folding energies of the consensus structure and the individual folding energies of a set of aligned RNAs. These parameters can be computed much more reliably than quantities that have to be derived from predicted base pairs due to the limited accuracy of the structure prediction algorithms on individual sequences [12]. We avoid the use of sequence motifs (e.g. [13]), since this bears the danger that the SVM is biased to the ncRNA families in the training set and fails to distinguish plus and minus strands of other structured ncRNAs.
Here we use:
1) Average of the folding energies of the individual sequences contained in the alignment, computed by the minimum energy folding program RNAfold of the Vienna RNA Package, version 1.6 [14] (meanmfe).
2) Mean of the energy z-scores of the individual sequences contained in the alignment (meanz). The z-score is defined as z = (E - E¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGfbqrgaqeaaaa@2DD7@)/σ, where E¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGfbqrgaqeaaaa@2DD7@ and σ are mean and standard deviation of the folding energy distribution of shuffled (permuted) sequences. We use here the same SVM-regression procedure as RNAz [15] to estimate the z-scores from the sequence composition to avoid the time consuming sampling of shuffled alignments.
3) Folding energy of the consensus secondary structure of the alignment computed by RNAalifold (consmfe). The parameter is defined as the optimal average of the folding energies that can be achieved when all aligned sequences simultaneously fold into the same structure.
4) Structure conservation index (sci), which is defined as the ratio of the consensus folding energy and the average of the folding energies of the individual sequences, i.e. sci = consmfe/meanmfe, [15]. An sci close to 1 indicates perfect structure conservation, while alignments without structural conservation yield values close to 0. A more detailed discussion of the sci can be found in [16] in the context of RNA alignment.
The first two descriptors assess the thermodynamic stability of the folds, while the last two evaluate structural conservation.
The reading direction of a structured ncRNA can be identified by evaluating the differences of the above descriptors between both strands. To be precise, the difference Δx of descriptor x is defined as Δx = x+ - x-, where x+ denotes the value of x in reading direction of the input alignment and x- the value of x in the reverse complementary alignment. Hence, Δmeanmfe and Δmeanz capture the energetic differences between both strands, while Δconsmfe and Δsci describe the differences in structure conservation.
The proportion of true positive and false positive rate (ROC curve) for each combination of descriptors is summarized in Fig. 1. It reveals which combination of descriptors achieves optimal classification of the alignments. The ROC curves can be evaluated by the area under the curve (AUC), which states the similarity of the ROC curve to a step function. The steeper the true positive rate increases while staying at its maximum value for different values of false positive rates, the better the input alignments can be separated. The best AUC of 99% is achieved when all four descriptors are taken.
Figure 1  Receiver operating characteristic of all descriptor combinations. Receiver operating characteristic (ROC) for all descriptor combinations. Corresponding AUC is given in brackets. ROC curves were computed by a 5-fold cross-validation on the training data set using plotroc.py of the libsvm 2.8 package [18] after an optimal SVM parameter set was chosen by grid.py. True positive and false positive rates are calculated by interpreting the SVM decision values. Prediction accuracies as plotted here are larger compared to accuracies in Table 1 as even though cross-validation ensures that training and testing is done on different alignments some sequences may occur in the training as well as in the test alignments. In contrast, accuracies in Table 1 are based on test alignments which do not contain any sequence attending at a training alignment. Note, that although sci = consmfe/meanmfe, i.e., these three quantities are not independent, this is not the case for their differences. Δsci cannot be computed from Δconsmfe and Δmeanmfe. Furthermore, for alignments where the structural conservation is very high in both reading directions the strand of the ncRNA cannot be inferred by Δsci alone. But the difference of consensus structure stability, which is measured by Δconsmfe may still predict the strand correctly.
Same holds for Δmeanz and Δmeanmfe. Both measure the folding energy differences of the individual sequences, but do not capture identical features of the input alignment nor can be transformed into each other. The mean z-score compares the average stability of individual sequences to a random control set. Whereas the mean of minimum free energies of individual sequences specifies the actual observed minimum free energies. The difference in z-scores describes the relative loss of stability compared to a random control set. It quantifies that the input alignment swaps from very stable to unstable between both strands. The difference in minimum free energy, on the other hand, is able to specify small changes in energies, which is needed to find the correct reading direction of the ncRNA in case both reading directions result in very stable structures. An example are miRNAs, which are very stable on both strands but are nevertheless successfully classified by RNAstrand. Hence, all four descriptors carry different information.
The significance of differences in folding energies depends on the number of sequences in the input alignment, denoted by n, and on sequence variation. The latter is conveniently quantified as the average pairwise sequence identity H of both reading directions.
The strongest strand information comes from GU base pairs which are unpaired in the reverse complementary alignment. Hence, the relevance of differences depends also on the overall number of GU base pairs in the consensus structure. Therefore, we introduce
λ G U   = ( n G U  +   n a l l  +    + n G U  −   n a l l  −    ) × 100 ,   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaWgaaWcbaGaem4raCKaemyvaufabeaakiabg2da9iabcIcaOmaalaaabaGaemOBa42aa0baaSqaaiabdEeahjabdwfavbqaaiabgUcaRaaaaOqaaiabd6gaUnaaDaaaleaacqWGHbqycqWGSbaBcqWGSbaBaeaacqGHRaWkaaaaaOGaey4kaSYaaSaaaeaacqWGUbGBdaqhaaWcbaGaem4raCKaemyvaufabaGaeyOeI0caaaGcbaGaemOBa42aa0baaSqaaiabdggaHjabdYgaSjabdYgaSbqaaiabgkHiTaaaaaGccqGGPaqkcqGHxdaTcqaIXaqmcqaIWaamcqaIWaamcqGGSaalaaa@5120@
as last descriptor. nGU+(nGU−) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaem4raCKaemyvaufabaGaey4kaScaaOGaeiikaGIaemOBa42aa0baaSqaaiabdEeahjabdwfavbqaaiabgkHiTaaakiabcMcaPaaa@37F9@ denotes the number of GU base pairs in the consensus secondary structure of the reading direction of the input alignment (reverse complement of the input alignment), and nall+ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemyyaeMaemiBaWMaemiBaWgabaGaey4kaScaaaaa@332D@ and nall− MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemyyaeMaemiBaWMaemiBaWgabaGaeyOeI0caaaaa@3338@ are the numbers of all base pairs in the consensus structure of the corresponding reading direction. Fig. 2 shows that alignments in the reading direction of a tRNA can not as easy be separated from the reverse complementary alignments by evaluating only Δmeanmfe, Δmeanz, Δconsmfe and Δsci as it is the case for alignments containing U70 snoRNAs. The majority of tRNAs have around 0–5% GU base pairs in their consensus secondary structure. (The percentage of GU pairs is roughly λGU/2.) In contrast, the majority of U70 snoRNAs have 10% to 20% GU base pairs in their consensus structure. λGU allows the SVM to find suitable classification values depending on the fraction of GU base pairs. Therefore, U70 snoRNAs as well as tRNAs are classified correctly with high accuracies (U70: 1.0, tRNA: 0.94).
Figure 2  GU base pair dependency. Scatter plots depicting separability between both strands depending on GU base pair content (histograms). Red data points denote alignments in the reading direction of the ncRNA, while black data points belong to their realigned reverse complements. Alignments of tRNAs and U70 snoRNAs do not have significantly different number of sequences nor differ significantly in mean pairwise identity (see Additional file 1). That alignments in reading direction of U70 snoRNA are well separated from their reverse complements compared to alignments containing tRNAs is due to high content of GU base pairs in the secondary structure of U70 snoRNAs. We regard GU base pair fraction rather of the consensus structure than of the predicted structures of the single sequences, as the structure prediction of RNAalifold is based on evolutionary information of a set of sequences and hence produces a fold more similar to the real structure than RNAfold is able to predict from one single sequence. We did not introduce the difference of GU base pairs as a descriptor, because the error rate of such an descriptor depends largely on the correctness of the predicted secondary structure. Small errors in structure prediction have a large impact on the difference of GU base pairs. In contrast, the difference in structure stability and conservation regards all base pairs and hence depends only very weakly on the correctness of individual base pairs.
In summary, the SVM classification is based on seven descriptors, of which four, Δmeanmfe, Δmeanz, Δconsmfe and Δsci directly measure differences between the reading directions, while the remaining three, n, H, and λGU provide information on the structure of the input alignment that allow the SVM to interpret the significance of strand differences.