A comprehensive comparison of comparative RNA structure prediction approaches Abstract Background An increasing number of researchers have released novel RNA structure analysis and prediction algorithms for comparative approaches to structure prediction. Yet, independent benchmarking of these algorithms is rarely performed as is now common practice for protein-folding, gene-finding and multiple-sequence-alignment algorithms. Results Here we evaluate a number of RNA folding algorithms using reliable RNA data-sets and compare their relative performance. Conclusions We conclude that comparative data can enhance structure prediction but structure-prediction-algorithms vary widely in terms of both sensitivity and selectivity across different lengths and homologies. Furthermore, we outline some directions for future research. Background Motivation RNA, once considered a passive carrier of genetic information, is now known to play a more active role in nature. Many recently discovered RNAs are catalytic, for example RNase P which is involved in tRNA maturation and the self-splicing introns involved in mRNA maturation [1]. In addition, there is evidence that RNA based organisms were an essential step in the evolution of modern DNA-protein based organisms [2,3]. The number of non-coding RNAs (ncRNA) in humans remains a mystery, but progress in this direction suggests the number of ncRNAs produced is comparable to the number of proteins [4-6]. Surprisingly, the number of protein coding genes does not correlate with our concept of "organism complexity", hence it has been hypothesised that control of gene expression via a combination of alternative splicing and non-coding RNAs are responsible for this, implying that the "Central Dogma" (RNA is transcribed from DNA and translated into protein) at least in higher eukaryotes is woefully inadequate [7,8]. A fundamental tenet of biology is that a stable tertiary structure is essential for biological function. In the case of RNA the secondary structure (the base-pair set for an RNA molecule) provides a scaffold for the tertiary structure [9,10]. Yet, the experimental determination of RNA structure remains difficult [11]; Researchers increasingly turn to computational methods. To date the most popular structure prediction algorithm is the Minimum Free Energy (MFE) method for folding a single sequence, this has been implemented by two packages: Mfold [12] and RNAfold [13]. However, there are several independent reasons why the accuracy of MFE structure prediction is limited in practise (see discussion below). Generally the best accuracy can be achieved by employing comparative methods [14]. This paper explores the extent to which this statement is true, given the current state of the art, for automated methods. There are currently three approaches to automated comparative RNA sequence analysis where the comparative study is supported by available algorithms (see plans A, B, and C, figure 1). A researcher following plan A may align sequences using standard multiple sequence alignment tools (i.e. ClustalW [15], t-coffee [16], prrn [17],...), then use signals provided by structure neutral mutations for the inference of a consensus structure. Frequently the mutual-information measure is used for this [18-20]. Recently tools have been developed that use a combination of MFE and a covariation-score [21,22] or probabilistic models compiled from large reference data-sets [23,24]. However, a multiple-sequence-alignment step assumes a well conserved sequence. This is often not so with swiftly evolving ncRNA sequences, in this case incorrect sequence alignments can destroy any covariation signal. This has motivated plan B, the use of the "Sankoff-Algorithm", an algorithm designed for the simultaneous alignment, folding and inference of a protosequence for a set of homologous structural RNA sequences [25]. The recurrences combine sequence alignment and Nussinov (maximal pairing) folding [26]. The algorithm requires extreme computational resources (O(n3m) in time, and O(n2m) in space, where n is the sequence length and m is the number of sequences). Current implementations, Foldalign [27,28], Dynalign [29] and PMcomp [26], are restricted implementations of the Sankoff-algorithm which impose pragmatic limits on the size or shape of substructures. The final approach (plan C) applies when no helpful level of sequence conservation is observed. We may exclude the sequence alignment step, predict secondary structures for each sequence (or sub-group of sequences) separately, and directly align the structures. Because of the nested branching nature of RNA structures, these are adequately represented as trees. The concept of a similarity measurement via edit operations, a standard procedure for string comparisons, has been generalised to trees [30-33]. Tree comparison and tree alignment models have been proposed [34,35] and implemented [13,36-39]. The crucial point in plan C is the question whether the initial independent folding produces at least some structures that align well and hence give clues as to the underlying consensus structure – when one exists. An increasing number of researchers have recently released novel RNA structure analysis and prediction algorithms [22,23,37,40-43]. Yet few algorithms are tested upon standardised example data-sets, and often they are not compared with algorithms of the same pedigree. Algorithm evaluation is a regular event for protein structure prediction groups [44-47], gene-prediction [48-50] and multiple sequence alignments [51-54]. Based on reliable data-sets, we evaluate: • the viability of plan A, B, or C given tools available today, and • the relative performance of the tools used within each plan. We shall explicitly not evaluate computational efficiency, which (by necessity) differs widely between the tools. We also do not evaluate user friendliness (such as ease of installation and convenience of input or output formats, etc.) except for some remarks in the discussion section. Data-sets, documentation and relevant scripts are freely available from . Structural alignments and consensus structures RNA secondary structure inference is the prediction of base-pairs which form the in vivo structure, given only the sequence of bases. Three general considerations apply: (1) The in vivo structure is not only predetermined by the primary structure, but also by cellular components such as chaperones, base modifications, and even by the transcriptional process itself. There are currently no computational tools available that assess these effects. (2) There are 'ribo-switches', whereby two or more functional structures exist for a given sequence [55-57]. Such cases will fool all the tools studied here, because asking for a single consensus structure is simply the wrong question. On the other hand, the potential of conformational switching can be reliably detected [58-60]. (3) Structures may contain pseudo-knots, which are ignored by most current tools due to reasons of computational complexity and scarcity of these motifs. We do not consider pseudoknots here. However, several comparative approaches that include pseudoknots are currently under development, and certainly merit a comparative study of their own. Note that in an application scenario, we often do not know whether the considerations (1–3) apply. The comparative approach to structure inference is initiated from a set of homologous RNA sequences. Attempts are made to infer the in-vivo structure for each of them, as well as a consensus structure that captures the common, relevant structural aspects. The consensus structure per se does not exist in vivo, and so some mathematical rigour should be applied when working with this notion. An RNA sequence is a string over the RNA alphabet {A, C, G, U}. An RNA sequence B = b1,...,bn contains n bases, but no structural information. For comparative analysis, we are given the RNA sequences B1,...,Bk. A secondary structure can be associated with each sequence B as a string S over the alphabet {"(",".",")"}, where parentheses in S must be properly nested, and B and S must be compatible: If (si, sj) are matching parentheses, then (bi, bj) must be a legal base-pair. A base-pair is also denoted as bi·bj, si·sj, or simply i·j when the sequence is clear from the context. Both sequences and structures may be padded with a gap symbol "-", in order to align sequences and structures of different lengths. For compatibility of padded sequences and structures, we require that bi = "-" iff si = "-". A multiple structural alignment is a multiple sequence alignment of the 2 * k sequences, B1, S1,..., Bk, Sk, such that Bi is compatible with Si, and the following consistency criterion is satisfied: For any Si and Sj and any base-pair , we have ≠ ")" and ≠ "(", and if = "(" or = ")", then . This means that if one partner of a base-pair in Sj is aligned to one partner in Si, their partners must also be aligned to each other (see figure 2 for an illustration). A consensus structure C for a multiple structural alignment can be determined by a majority rule approach using a threshold p with 0.5

97%) [61]. This data specifies a "consensus by example", as explained above, to which our consensus reconstruction was applied to obtain the "true" consensus. To avoid results bias, we constructed test alignments, with corresponding phylogenies that, wherever possible, were free of highly similar clades. In addition, we endeavoured to ensure that the reference sequence was central to the phylogeny, or more specifically, not an out group. To meet these requirements, sequences from large data-sets were sorted into high-similarity and medium-similarity groups (with respect to the model sequence), from which maximum-likelihood phylogenies [62] were constructed. These were pruned until the desired size and topology was achieved. For each data-set two sequence alignments were constructed, one of high sequence identity (approximately 90–99%) and the other more diverse data-set of medium sequence identity (approximately 70–90%). Our data-sets are quite diverse and must for the purposes of this study be considered difficult to analyse in structural terms. The shape of ribosomal RNA is believed to be influenced by interaction with ribosomal proteins. The shape of RNase P shows relatively little sequence and structure conservation, and furthermore, it contains pseudoknots which are generally excluded by prediction algorithms. Transfer RNAs are known to be a hard case for thermodynamic folding, primarily due to the propensity of modified bases which influence structure formation. All tools tested may perform better upon less complex data-sets, but the purpose of this study is not to show how good the algorithms are but to compare relative performance when prediction is difficult. Performance Measures Sensitivity (X) and selectivity (Y) are common measures for determining the accuracy of prediction methods [63]. Selectivity is also known as the "specificity" [28] and "positive predictive value" [64,65]. We use slightly modify versions of the standard definitions of X and Y for examining RNA secondary structure prediction: where TP is the number of "true positives" (correctly predicted base-pairs), FN is the number of "false negatives" (base-pairs in the reference structure that were not predicted) and FP is the number of "false positives" (in-correctly predicted base-pairs). However, not all FP base-pairs are equally false! We classify FP base-pairs as either inconsistent, contradicting or compatible. Predicted base-pairs which conflict with a base-pair in the reference structure are labelled inconsistent (i.e. i·j is predicted where either i·k and/or h·j are paired in the reference structure (h ≠ i and j ≠ k)). Predicted base-pairs (i·j) which are non-nested with respect to the reference structure are labelled contradicting (i.e. there exists base-pairs k·l in the reference satisfying k 1, where , , and pij are pair-probabilities computed using McCaskilPs partition function [75]. The new accuracy ranges were 29–71%, 92–100% and 0.53–0.84. A related approach for trimming of low probability was recently shown to improve the selectivity of MFE predictions [65]. MARNA is generally less dependant upon the accuracy of the input structures hence performs slightly better with the poorly predicted tRNA structures than RNAforester. Discussion We have evaluated three different strategies for comparative structure prediction, and altogether eight tools (not counting the single sequence methods). The results of which are summarised in figures 3 &4. A surprising discovery given that the test data-sets are so diverse is that algorithm specific clusters formed in sensitivity versus selectivity scatter plots, indicating algorithm-specific eccentricities. A number of algorithms which might have been evaluated here have been excluded, primarily due to the heavy computational costs of the various implementations on our longer data-sets. We favoured recent algorithms which could be compiled on modern computers and those with input and output which could be simply dealt with (for example returning dot-bracket [13,37,91] or tabular-connect type formats [12,29,41], rather than coordinates and lengths of stacks or graphic (gif/pdf) representations favoured by a minority of researchers). Practical recommendations For well aligned short sequences, both Pfold and RNAalifold generally perform well, PFold performed marginally better than RNAalifold. It is likely that some moderate refinements to RNAalifold would improve accuracy without altering the efficiency, for example, if gaps were not penalised in the free-energy evaluation and a more sophisticated model for scoring mutations was employed, perhaps ribosum matrices [92] could be used to weight base-pair bonuses and penalties. For well aligned, long sequences the performance and speed of RNAalifold was excellent. For data-sets consisting of short (< 200 bases) and diverse sequences Dynalign might do well, as it does not require sequence similarity – in fact the scoring function does not include sequence comparison. Otherwise, one might choose to use a mixture of RNAalifold and/or Pfold to fold similar clades and RNAforester and/or MARNA to align folded clades. Advocates of plan A should note that many multiple sequence alignment algorithms generally do not favour transitions over transversions or employ ad hoc 2-parameter methods to model these (ClustalW [15] for example). Structural RNA sequences however evolve rapidly via structure neutral mutations which are frequently transitions and rarely transversions [92,93]. Multiple sequence algorithms which employ more complex yet more accurate models of sequence evolution will undoubtedly produce "better" alignments for folding. Carnac produced highly selective structures for all the test data-sets, which if used to constrain a free energy fold produced sensitive predictions with a cost to selectivity. The consistency of Carnac performance is remarkable, for all the data-sets considered here this heuristic approach performed well. It is however unclear how Carnac will perform on highly diverse data-sets. For advocates of plan C, we have an encouraging message: Both MARNA and RNAforester perform better on the medium similarity data than on high similarity data. This seems paradoxical at first glance, but one must understand that for an approach purely based on predicted structures, high sequence similarity can be a curse rather than a blessing: If sequences are very similar, they may jointly fold into the wrong MFE structure. With more sequence variation, it becomes more likely that at least some family members have good predictions, which by their mutual similarity can be picked out from the rest. This means that especially in the case of low sequence similarity, where nothing else works, plan C, currently the least explored strategy of all, has a certain promise. Conclusions Finally, let us outline some directions for future research. An implementation of the single sequence pseudoknot algorithms [42,43,94] employing similar strategies to RNAalifold [21] for alignment folding would be most useful. Based upon the RNAalifold results this approach would dramatically increase the accuracy of these algorithms upon certain data-sets. Also, an extension of these allowing constrained foldings to incorporate prior knowledge would be of assistance, this has proved extremely useful for MFE predictions. Sampling structures from reference alignments is also likely to prove beneficial. The implementation of fast and accurate variants of the Sankoff algorithm remains an open problem. Again allowing constrained foldings and alignments would be useful. The further development of "BLAST-like" folding heuristics for this should be a priority, obviously Carnac is a good start. The MARNA approach for producing structurally enhanced multiple alignments produced rather selective results after trimming high-entropy base-pairs from MFE predictions. This suggests that weighting edit-distances with partition-function derived probabilities or entropies will produce reasonable RNA alignments. A consensus structure could then be derived from MFE-structures or from PFold or RNAalifold predictions on the resultant alignment. This approach would effectively decouple the Sankoff algorithm into manageable structure-enhanced-alignment and folding stages. Note added in proof Two further developments are likely to increase the power of plan C. Pure multiple structure alignment (as opposed to pairwise alignment used here) presented in [95] may leave out some misfolded structures from a progressively constructed profile aligment. A small but representative set of near-optimal structures can now be derived by abstract shape analysis [96]. Combining both approaches, one could consider a progressive multiple alignment approach where these representative, near-optimal structures are included for each sequence. More training data is essential for this field to progress, for this homology search tools are essential. Infernal [91,97] used to construct the Rfam database [98,99] is an excellent approach but sensitivity might be increased with a phylogenetic approach and RNA-specific sequence search tools. The implementation of methods combining energetics, covariation [21] and co-transcriptional folding [100] in a statistically reasonable manner is also a potentially fruitful direction for development. Authors' contributions PPG carried out the experiments, the analysis and drafted the manuscript. RG suggested comparing comparative structure prediction methods and assisted in the manuscript preparation. All authors read and approved the final manuscript.