> top > docs > PMC:1459172 > spans > 22125-22770

PMC:1459172 / 22125-22770 JSONTXT

Partition function and base pairing probabilities of RNA heterodimers Abstract Background RNA has been recognized as a key player in cellular regulation in recent years. In many cases, non-coding RNAs exert their function by binding to other nucleic acids, as in the case of microRNAs and snoRNAs. The specificity of these interactions derives from the stability of inter-molecular base pairing. The accurate computational treatment of RNA-RNA binding therefore lies at the heart of target prediction algorithms. Methods The standard dynamic programming algorithms for computing secondary structures of linear single-stranded RNA molecules are extended to the co-folding of two interacting RNAs. Results We present a program, RNAcofold, that computes the hybridization energy and base pairing pattern of a pair of interacting RNA molecules. In contrast to earlier approaches, complex internal structures in both RNAs are fully taken into account. RNAcofold supports the calculation of the minimum energy structure and of a complete set of suboptimal structures in an energy band above the ground state. Furthermore, it provides an extension of McCaskill's partition function algorithm to compute base pairing probabilities, realistic interaction energies, and equilibrium concentrations of duplex structures. Availability RNAcofold is distributed as part of the Vienna RNA Package, . Contact Stephan H. Bernhart – berni@tbi.univie.ac.at Background Over the last decade, our picture of RNA as a mere information carrier has changed dramatically. Since the discovery of microRNAs and siRNAs (see e.g. [1,2] for a recent reviews), small noncoding RNAs have been recognized as key regulators in gene expression. Both computational surveys, e.g. [3-7] and experimental data [8-11] now provide compelling evidence that non-protein-coding transcripts are a common phenomenon. Indeed, at least in higher eukaryotes, the complexity of the non-coding RNome appears to be comparable with the complexity of the proteome. This extensive inventory of non-coding RNAs has been implicated in diverse mechanisms of gene regulation, see e.g. [12-16] for reviews. Regulatory RNAs more often than not function by means of direct RNA-RNA binding. The specificity of these interactions is a direct consequence of complementary base pairing, allowing the same basic mechanisms to be used with very high specificity in large collections of target and effector RNAs. This mechanism underlies the post-transcriptional gene silencing pathways of microRNAs and siRNAs (reviewed e.g. in [17]), it is crucial for snoRNA-directed RNA editing [18], and it is used in the gRNA directed mRNA editing in kinetoplastids [19]. Furthermore, RNA-RNA interactions determine the specificity of important experimental techniques for changing the gene expression patterns including RNAi [20] and modifier RNAs [21-24]. RNA-RNA binding occurs by formation of stacked intermolecular base pairs, which of course compete with the propensity of both interacting partners to form intramolecular base pairs. These base pairing patterns, usually referred to as secondary structures, not only comprise the dominating part of the energetics of structure formation, they also appear as intermediates in the formation of the tertiary structure of RNAs [25], and they are in many cases well conserved in evolution. Consequently, secondary structures provide a convenient, and computationally tractable, approximation not only to RNA structure but also to the thermodynamics of RNA-RNA interaction. From the computational point of view, this requires the extension of RNA folding algorithms to include intermolecular as well as intramolecular base pairs. Several approximations have been described in the literature: Rehmsmeier et al. [26] as well as Dimitrov and Zuker [27] introduced algorithms that consider exclusively intermolecular base pairs, leading to a drastic algorithmic simplification of the folding algorithms since multi-branch loops are by construction excluded in this case. Andronescu et al. [28], like the present contribution, consider all base pairs that can be formed in secondary structures in a concatenation of the two hybridizing molecules. This set in particular contains the complete structural ensemble of both partners in isolation. Mückstein et al. [29] recently considered an asymmetric model in which base pairing is unrestricted in a large target RNA, while the (short) interaction partner is restricted to intermolecular base pairs. A consistent treatment of the thermodynamic aspects of RNA-RNA interactions requires that one takes into account the entire ensemble of suboptimal structures. This can be approximated by explicitly computing all structures in an energy band above the ground state. Corresponding algorithms are discussed in [30] for single RNAs and in [28] for two interacting RNAs. A more direct approach, that becomes much more efficient for larger molecules, is to directly compute the partition function of the entire ensemble along the lines of McCaskill's algorithm [31]. This is the main topic of the present contribution. As pointed out by Dimitrov and Zuker [27], the concentration of the two interacting RNAs as well as the possibility to form homo-dimers plays an important role and cannot be neglected when quantitative predictions on RNA-RNA binding are required. In our implementation of RNAcofold we therefore follow their approach and explicitly compute the concentration dependencies of the equilibrium ensemble in a mixture of two partially hybridizing RNA species. This contribution is organized as follows: We first review the energy model for RNA secondary structures and recall the minimum energy folding algorithm for simple linear RNA molecules. Then we discuss the modifications that are necessary to treat intermolecular base pairs in the partition function setting and describe the computation of base pairing probabilities. Then the equations for concentration dependencies are derived. Short sections summarize implementation, performance, as well as an application to real-world data. RNA secondary structures A secondary structure S on a sequence x of length n is a set of base pairs (i, j), i The presentation in this section closely follows a recent paper by Dimitrov [27], albeit we use here slightly different definitions of the partitions functions. The partition functions of the secondary structures of the monomeric states are ZA and ZB, respectively, as introduced in the previous section. In contrast to [27], we include the unfolded states in these partition functions. The partition functions ZAA, ZBB, and ZAB, which are the output of the RNAcofold algorithm (denoted Z in the previous section), include those states in which each monomer forms base-pairs only within itself as well as the unfolded monomers. We can now define as the partition functions restricted to the true dimer states, but neglecting the initiation energies ΘI. An additional symmetry correction is needed in the case of the homo-dimers: A structure of a homo-dimer is symmetric if for any base pair (i, j) there exists a pair (i', j'), where i' (j') denotes the equivalent of position i in the other copy of the molecule. Such symmetric structures have a two-fold rotational symmetry that reduces their conformation space by a factor of 2, resulting in an entropic penalty of ΔGsym = RT ln 2. On the other hand, since the recursion for the partition functions eq. 6 assumes two distinguishable molecules A and B, any asymmetric structures of a homo-dimer are in fact counted twice by the recursion. Leading to the same correction as for symmetric structures. Since both the initiation energy ΘI and the symmetry correction ΔGsym are independent of the sequence length and composition, the thermodynamically correct partition functions for the three dimer species are given by From the partition functions we get the free energies of the dimer species, such as FAB = -RT ln , and the free energy of binding ΔF = FAB - FA - FB. We assume that pressure and volume are constant and that the solution is sufficiently dilute so that excluded volume effects can be neglected. The many particle partition function for this system is therefore [27] where a = nA + 2nAA+ nAB is the total number of molecules of type A put into the solution (equivalently for b); nA, nB, nAA, nBB, nAB are the particle numbers for the five different monomer and dimer species, V is the volume and n is the sum of the particle numbers. The system now minimizes the free energy -kT ln , i.e., it maximizes , by choosing the particle numbers optimally. As in [27], the dimer concentrations are therefore determined by the mass action equilibria: [AA] = KAA[A]2 [BB] = KBB[B]2 [AB] = KAB[A][B]     (14) with Concentrations in eq.(14) are in mol/l. Note, however, that the equilibrium constants in eq.(15) are computed from a different microscopic model than in [27], which in particular also includes internal base pairs within the dimers. Together with the constraints on particle numbers, eq.(14) forms a complete set of equations to determine x = [A] and y = [B] from a and b by solving the resulting quadratic equation in two variables: 0 = f(x, y) : = x + KABxy + 2KABx2 - a 0 = g(x, y) : = y + KABxy + 2KBBy2 - b     (16) The Jacobian of this system is strictly positive and diagonally dominated, and hence invertible on ℝ+ × ℝ+. Furthermore f and g are thrice continuously differentiable on = [0, a] × [0, b] and we know (because of mass conservation and the finiteness of the equilibrium constants) that the solution (, ) is contained in the interior of the rectangle . Newton's iteration method thus converges(at least) quadratically [48, 5.4.2]. We use (a, b) as initial values for the iteration. Implementation and performance The algorithm is implemented in ANSI C, and is distributed as part of the of the Vienna RNA package. The resource requirements of RNAcofold and RNAfold are theoretically the same: both require (n3) CPU time and (n2) memory. In practice, however, keeping track of the cut makes the evaluation of the loop energies much more expensive and increases the CPU time requirements by an order of magnitude: RNAcofold takes about 22 minutes to cofold an about 3000 nt mRNA with a 20 nt miRNA on an Intel Pentium 4 (3.2 GHz), while RNAfold takes about 3 minutes to fold the concatenated molecule. The base pairing probabilities are represented as a dot plot in which squares with an area proportional to Pij represent the the raw pairing probabilities, see Fig. 2. The dot plot is provided as Postscript file which is structured in such a way that the raw data can be easily recovered explicitly. RNAcofold also computes a table of monomer and dimer concentrations dependent on a set of user supplied initial conditions. This feature can readily be used to investigate the concentration dependence of RNA-RNA hybridization, see Fig. 3 for an example. Figure 2 Dot plot (left) and mfe structure representation (right) of the cofolding structure of the two RNA molecules AUGAAGAUGA (red) and CUGUCUGUCUUGAGACA (blue). Dot Plot: Upper right: Partition function. The area of the squares is proportional to the corresponding pair probabilities. Lower left: Minimum free energy structure. The two lines forming a cross indicate the cut point, intermolecular base pairs are depicted in the green upper right (partition function) and lower left (mfe) rectangle. Figure 3 Example for the concentration dependency for two mRNA-siRNA binding experiments. In [54], Schubert et al. designed several mRNAs with identical target sites for an siRNA si, which are located in different secondary structures. In variant A, the VR1 straight mRNA, the binding site is unpaired, while in the mutant mRNA VR1 HP5-11, A', only 11 bases remain unpaired. We assume an mRNA concentration of a = 10 nmol/1 for both experiments. Despite the similar binding pattern, the binding energies (ΔF = FAB - FA - FB) differ dramatically. In [54], the authors observed 10% expression for VR1 straight, and 30% expression for the HP5-11 mutant. Our calculation shows that even if siRNA is added in excess, a large fraction of the VR1 HP5-11 mRNA remains unbound. Like RNAfold, RNAcofold can be used to compute DNA dimers by replacing the RNA parameter set by a suitable set of DNA parameters. At present, the computation of DNA-RNA heterodimers is not supported. This would not only require a complete set of DNA-RNA parameters (stacking energies are available [49], but we are not aware of a complete set of loop energies) but also further complicate the evaluation of the loop energy contributions since pure RNA and pure DNA loops will have to be distinguished from mixed RNA-DNA loops. Applications Intermolecular binding of RNA molecules is important in a broad spectrum of cases, ranging from mRNA accessibility to siRNA or miRNA binding, RNA probe design, or designing RNA openers [50]. An important question that arises repeatedly is to explain differences in RNA-RNA binding between seemingly very similar or even identical binding sites. As demonstrated e.g. in [22,29,51,52], different RNA secondary structure of the target molecule can have dramatic effects on binding affinities even if the sequence of the binding site is identical. Since the comparison of base pairing patterns is a crucial step in such investigations we provide a tool for graphically comparing two dot plots, see Fig. 4. It is written in Perl-Tk and takes two dot plot files and, optionally, an alignment file as input. The differences between the two dot plots are displayed in color-code, the dot plot is zoomable and the identity and probability(-difference) of a base pair is displayed when a box is clicked. Figure 4 Difference dot Plot of native and mutated secondary structure of a 3 GU mutation of the CXCR4 siRNA gene. The red part on the right hand side shows the base pairing probability of the 5' part of the micro RNA, which is 80% higher in the native structure. This is an alternative explanation for the missing function of the mutant. Because of the mutations, the stack a little to the left gets more stable, and the probability of binding of the 5' end of the siRNA is reduced significantly.The color of the dots encodes the difference of the pair probabilities in the two molecules such positive (red) squares denote pairs more more probable in the second molecule (see color bar). The area of the dots is proportional to the larger of the two pair probabilities. As a simple example for the applicability of RNAcofold, we re-evaluate here parts of a recent study by Doench and Sharp [53]. In this work, the influence of GU base pairs on the effectivity of translation attenuation by miRNAs is assayed by mutating binding sites and comparing attenuation effectivity to wild type binding sites Introducing three GU base pairs into the mRNA/miRNA duplex did, with only minor changes to the binding energy, almost completely destroy the functionality of the binding site. While Doench and Sharp concluded that miRNA binding sites are not functional because of the GU base pairs, testing the dimer with RNAcofold shows that there is also a significant difference in the cofolding structure that might account for the activity difference without invoking sequence specificities: Because of the secondary structure of the target, the binding at the 5' end of the miRNA is much weaker than in the wild type, Fig. 4. Limitations and future extensions We have described here an algorithm to compute the partition function of the secondary structure of RNA dimers and to model in detail the thermodynamics of a mixture of two RNA species. At present, RNAcofold implements the most sophisticated method for modeling the interactions of two (large) RNAs. Because the no-pseudoknot condition is enforced to limit computational costs, our approach disregards certain interaction structures that are known to be important, including kissing hairpin complexes. The second limitation, which is of potential importance in particular in histochemical applications, is the restriction to dimeric complexes. More complex oligomers are likely to form in reality. The generalization of the present approach to trimers or tetramers is complicated by the fact that for more than two molecules the results of the calculation are not independent of the order of the concatenation any more, so that for M-mers (M - 1)! permutations have to be considered separately. This also leads to bookkeeping problems since every secondary structure still has to be counted exactly once. Acknowledgements This work has been funded, in part, by the Austrian GEN-AU projects bioinformatics integration network & bioinformatics integration network II, and the German DFG Bioinformatics Initiative BIZ-6/1-2.

Document structure show

projects that have annotations to this span

Unselected / annnotation Selected / annnotation
2_test (2)