> top > docs > PMC:514697 > spans > 901-1018

PMC:514697 / 901-1018 JSONTXT

Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics Abstract Background The general problem of RNA secondary structure prediction under the widely used thermodynamic model is known to be NP-complete when the structures considered include arbitrary pseudoknots. For restricted classes of pseudoknots, several polynomial time algorithms have been designed, where the O(n6)time and O(n4) space algorithm by Rivas and Eddy is currently the best available program. Results We introduce the class of canonical simple recursive pseudoknots and present an algorithm that requires O(n4) time and O(n2) space to predict the energetically optimal structure of an RNA sequence, possible containing such pseudoknots. Evaluation against a large collection of known pseudoknotted structures shows the adequacy of the canonization approach and our algorithm. Conclusions RNA pseudoknots of medium size can now be predicted reliably as well as efficiently by the new algorithm. Background Biological relevance Pseudoknots have been shown to be functionally relevant in many RNA mediated processes. Examples are the self-splicing group I introns [1], ribosomal RNAs, or RNaseP. Recently, pseudoknots were located in prion proteins of humans, and confirmed for many other species [2]. With the current increased interest in the universe of RNA functions [3], algorithmic support for analysing structures that include pseudoknots is much in demand. Previous algorithmic work Well established algorithms for the prediction of RNA secondary structures (MFOLD [4], RNAfold [5]) are commonly based on a thermodynamic model [6], returning a structure of minimal free energy, called MFE-structure for short. In spite of their importance, pseudoknots are excluded from consideration by these programs for reasons of computational complexity: While folding a sequence of length n into unknotted structures requires O(n3) time and O(n2) space, finding the best structure including arbitrary pseudoknots has been proved to be NP-complete [7,8]. In fact, the proof given in [8] uses a scoring scheme based on adjacent base pairs only, simpler than the MFE model because it neglects entropic energies from loops. These complexity results leave two routes to achieve practical algorithms. The first route is to consider pseudoknots in full generality, but resort to an even more simplistic energy model. An O(n4) time and O(n3) space algorithm for base pair maximization has been given in [7], and an O(n3) time algorithm based on maximum weight matching in [9] and [10]. The second route is the one followed here: We retain the established thermodynamic model, but restrict to a more tractable subclass of pseudoknots. For some quite general classes of pseudoknots, polynomial time algorithms have been designed: Rivas and Eddy achieve O(n6) time and O(n4) space [11]. This algorithm is available, and, in spite of the high computational cost, it is actually used in practice. We build upon this work and shall call it pknotsRE for later reference. Further improvements have been shown to be possible for yet more restricted classes, e.g. the non-recursive simple pseudoknots considered by Lyngsø and Pedersen [12] with O(n5) time and O(n4) space, but to our knowledge, no implementations are available. Recently, an O(n4) time and O(n3) space algorithm based on the technique of [7], that uses a thermodynamic model has been reported in [13]. While it handles simple pseudoknots consisting of more than two helices, it is restricted to non-recursive pseudoknots. Thus, this class of pseudoknots and the class presented here have a nonempty intersection, but neither of them contains the other. Our contributions The new contributions reported here are the following: • We present an algorithm pknotsRG for folding RNA secondary structures including pseudoknots under the MFE model which requires O(n4) time and O(n2) space. • The algorithm considers the class of simple recursive pseudoknots, further restricted by three rules of canonization. Each simple recursive pseudoknot has a canonical representative that is recognized by pknotsRG. • While this class is more restricted than the one of the Rivas/Eddy algorithm, practical evaluation shows that our algorithm finds the same pseudoknots, while the length range of tractable sequences is increased significantly. • We provide an evaluation of the class of pseudoknots introduced here against known examples from the literature. • We perform a rigorous evaluation of our algorithm on 212 sequences from PseudoBase [14] plus 7 other structures and compare our results with those obtained with RNAfold and, where feasible, with pknotsRE. Results It is not easy to relate the classes of pseudoknots recognized by the different algorithms mentioned above. We refer the reader to the review by Lyngsø and Pedersen [8], which compares these classes by means of examples. The starting point of our work is the algorithm pknotsRE by Rivas and Eddy. It recognizes pseudoknots that can be nested and can have unlimited chains of helices involved in crosswise interactions. The drawback of this powerful, but computationally expensive algorithm is the following paradox: Pseudoknots with complex helix interactions naturally require longer primary sequence than simpler ones. The high runtime complexity of O(n6), however, as well as the space consumption of O(n4) restricts the use of this algorithm to a maximal sequence length of around 150 nucleotides. Most of the pseudoknots predicted belong to a much simpler structural class and do not exhibit chains of crosswise interactions. The algorithm developed here achieves time complexity O(n4) and space complexity O(n2). The runtime improvement, compared to pknotsRE, results from an idea of canonization, while the space improvement results from disallowing chained pseudoknots. These improvements extend the range of tractable sequences to a length up to 800 nucleotides, and we can locate pseudoknots up to this size in even longer sequences. Simple recursive pseudoknots Following the terminology of [7], a simple pseudoknot is a crosswise interaction of two helices, as shown in Figure 1. In simple recursive pseudoknots, we allow the unpaired strands u, v, w in a simple pseudoknot to fold internally in an arbitrary way, including simple recursive pseudoknots. Let us call this class sr-PK. More complex knotted structures like triple crossing helices or kissing hairpins, as shown in Figure 4, are excluded from sr-PK. We will show later how they can be integrated in our approach and outline the increased computational cost of doing so. For the main part of this paper, we concentrate on the class sr-PK. Anticipating the complexity of a DP algorithm Thermodynamic RNA folding is implemented via dynamic programming (DP). We start with a semi-formal discussion of how to estimate the efficiency of a DP algorithm for folding (or any kind of motif search) before it is written in detail. We consider elements of RNA structure as sequence motifs of different types: hairpins, bulges, multiloops, etc. The following notation is taken from the algebraic dynamic programming approach [15]. By an equation m = f <<< a ~~~ b ~~~ c | | | g <<< c ~~~ a we specify that the sequence motif m can be composed in two alternative ways: The first case, labelled by f, requires adjacent occurrences of motifs a, b and c. The second case, labelled by g, requires adjacent occurrences of motifs c and a. When motif m is to be scored, f and g are seen as the scoring functions that combine the local score contribution of each case with the scores of sub-motifs a, b, and c. What is the computational effort of locating motif m in an input sequence x of length n, say at sequence positions i through j? First we assume that all motifs can have arbitrary size between 0 and n. The algorithm must consider all boundary positions (i, j) for motif m, which requires O(n2) steps at least. In case g, it must consider all boundary positions k where motifs c meets a, such that the runtime for case g is in O(n3). In case f, there are two such moving boundaries k and l between the three sub-motifs, so we obtain O(n4) overall for motif m. This can be improved if there is an upper bound on the size of some motif involved. If motif a is a single base, for example, the exponent of n decreases by 1 in both cases. Furthermore, if motif b is (say) a loop of maximal size 40, then one factor of n is reduced to a constant factor and overall asymptotic runtime is now O(n2). Sometimes a motif description can be restructured to improve efficiency by reducing the number of moving boundaries. Whether or not this is possible does not depend on the motif structure, but on the scoring scheme! This is a somewhat surprising observation from [15], where such optimizations are studied, and where also the line of reasoning exercised here is given a mathematical basis. In the sequel, we shall exploit another source of efficiency improvement. If the lengths of two sub-motifs are coupled somehow, say a and c have the same length, then the boundaries k and l in case f are related by k - i = j - l. When iterating over k, we can use l := j - k + i (rather than k ≤ l ≤ j)and save another factor of n. Canonization When the search space of a combinatorial problem seems to be too complex to be evaluated efficiently, heuristics are employed. Canonization restricts the search space in a well-defined way, arguing that all the relevant solutions in the full search space have a representative that is canonical, and hence, nothing relevant is overlooked. One such technique is the purging of structures that have isolated basepairs. Here the plausibility argument refers to the underlying energy model, where base pairings without stacking have little or no stabilizing effect. This canonization does not affect efficiency, but it achieves a significant reduction of the search space (figures in [16]), which renders the enumeration of near-optimal solutions [17] much more meaningful. We shall introduce three canonization rules that reduce class sr-PK to the class of canonized simple recursive pseudoknots, csr-PK. Using the notation introduced above, the motif definition of a simple recursive pseudoknot is given by knot = knt <<< a ~~~ u ~~~ b ~~~ v ~~~ a' ~~~ w ~~~ b' with boundaries at sequence positions i, e, k, g, f, l, h, j as shown in Figure 2. Segment a forms a helix with a', and b with b'. Segments u, v, and w can have arbitrary structures, including pseudoknots. Naively implemented, we can expect a DP algorithm of time complexity O(n8) according to our efficiency estimation technique introduced above. We now apply canonization. Note that it only applies to helices forming pseudoknots; other helices are unaffected. We first present the technical aspects; the discussion of these restrictions is deferred to the next section. Canonization rule 1 (a) Both strands in a helix must have the same length, i.e. |a| = |a'| and |b| = |b'|. (b) Both helices must not have bulges. Note that (b) is a stronger restriction and trivially implies (a). Under the regime of Rule 1 we may conclude: f = l - (e - i) h = j - (g - k) We are left with 6 out of 8 boundaries that vary independently, and runtime is down to O(n6). Canonization rule 2 The helices a, a' and b, b' facing each other must have maximal extent, or in other words, compartment v must be as short as possible under the rules of base pairing. We observe that the maximal length of a and a' is fixed once i and l are chosen. The maximal helix length stacklen(i, l) can be precomputed and stored in an O(n2) table. The same observation holds with respect to the other helix, and we fix e = i + stacklen(i, l) g = k + stacklen(k, j). Thus, we are left with only four independently moving boundaries – i, k, l, j –, and can hope to obtain an algorithm with runtime O(n4). Scores of pseudoknots found between i and j are stored in table knot(i, j), and hence the space requirements are O(n2), which is the same asymptotic space efficiency as in the folding of unknotted structures. A subtlety arises when both helices, chosen maximally, compete for the same bases of v, or in other words, the length of v would become negative. This case is addressed by Canonization rule 3 If two maximal helices would overlap, their boundary is fixed at an arbitrary point between them. Let m and m' be the helix lengths so determined. We finally obtain e = i + m g = k + m' The language of pseudoknots in class csr-PK can be defined by a simple context free grammar over an infinite terminal alphabet. Let ak denote a terminal symbol of k times the letter a. The grammar uses a single nonterminal symbol S and its productions are for arbitrary k, l ≥ 1. For example, the simple pseudoknot of Figure 1 is represented as the string .. [[[......{{..]]]]..........}}. This grammar is useful to judge how different an experimentally determined structure is from class csr-PK. It is not useful for programming, since it is ambiguous and does not distinguish the fine grained level of detail required in the energy model. Canonical representatives A careful discussion is required to show that each simple recursive pseudoknot, if not canonical by itself, has (a) a canonical representative of (b) similar free energy. Rule 1 (b) affects the length of helices that are considered in forming the pseudoknot. Let there be a pseudoknot between i' and j'. It is not canonical if one of the two helices contains bulges. However, there must be at least one pair of shorter helices without bulges at i, j with i' ≤ i and j ≤ j', which serves as a canonical representative, albeit with somewhat higher free energy. Rule 2 is justified by the fact that the energy model strongly favours helix extension. Clearly, for each family of pseudoknots delineated by i, k, l, j there is a canonical one with maximal helices, whose free energy is at least as low – except for the following case: The maximal helices compete with the internal structure of u, v and w. It may be possible to contrive a structure where shortening (say) helix a – a' by one base pair allows to create two pairs with new partner bases in u and v, resulting in a structure which has slightly lower energy. Still, the free energy of the canonical pseudoknot must be very similar. Finally, Rule 3 requires a decision where to draw the border between two helices facing each other and competing for the same bases. An arbitrary decision here can only slightly affect free energy, as the same base pairs are stacked either on the a – a' or the b – b' helix. Let E(s) denote the free energy computed for structure s. Summing up, we have shown that for each simple recursive pseudoknot K, there is a canonical one C in the search space. While we cannot prove that E(C) ≤ E(K), we have argued that this is likely, and if not, the energies will at least be close. Still, there might be another, energetically optimal canonical structure S (knotted or not) such that E(K)

Document structure show

projects that have annotations to this span

There is no project