PMC:1479348 / 38723-38739 JSONTXT

Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics Abstract The technique of Finite Markov Chain Imbedding (FMCI) is a classical approach to complex combinatorial problems related to sequences. In order to get efficient algorithms, it is known that such approaches need to be first rewritten using recursive relations. We propose here to give here a general recursive algorithms allowing to compute in a numerically stable manner exact Cumulative Distribution Function (CDF) or complementary CDF (CCDF). These algorithms are then applied in two particular cases: the local score of one sequence and pattern statistics. In both cases, asymptotic developments are derived. For the local score, our new approach allows for the very first time to compute exact p-values for a practical study (finding hydrophobic segments in a protein database) where only approximations were available before. In this study, the asymptotic approximations appear to be completely unreliable for 99.5% of the considered sequences. Concerning the pattern statistics, the new FMCI algorithms dramatically outperform the previous ones as they are more reliable, easier to implement, faster and with lower memory requirements. 1 Introduction The use of Markov chains is a classical approach to deal with complex combinatorial computations related to sequences. In the particular case of pattern count on random sequences, [5] named this method Finite Markov Chain Imbedding (FMCI, see [11] or [7] for a review). Using this technique it is possible to compute exact distributions otherwise delicate to obtain with classical combinatorial methods. More recently, [12] proposed a similar approach to consider local score on i.i.d. or Markovian ([13]) random sequences. Although these methods are very elegant, they could require a lot of time and memory if they are implemented with a naive approach. The authors of [6] first stated that recursive relation could be established for any particular case in order to provide an efficient way to perform the computations. We propose here to explore in detail this idea with the aim to provide fast algorithms able to compute with high numerical accuracy both CDF (cumulative distribution function) and CCDF (complementary CDF) of any general problem which can be written as a FMCI. We apply then these results to the particular cases of local score and pattern statistics. In each case, asymptotic developments are derived and numerical results are presented. 2 Methods In this part, we first introduce in section 2.1 the FMCI and see the limits of naive approaches to their corresponding numerical computations. The main results are given in section 2.3 where we propose two effective algorithms able to to compute general FMCI p-values (algorithm 1) or complementary p-value (algorithm 2). The theoretical background for these algorithms is given in the section 2.2. 2.1 Finite Markov Chain Imbedding Let us consider X = X1,...,Xn a sequence of Bernoulli or Markov observations and En an event depending on the sequence X. We suppose that it is possible to build from X an order one Markov chain Z = Z1,...,Zn on the finite state space of size L. This space contains (in the order): k starting states denoted s1,...,sk, some intermediate states, and one final absorbing state f. The Markov chain is designed such as ℙ(En|Z1 = si) = ℙ(Zn = f|Z1 = si) = ∏n-1(si, f)     (1) where is the transition matrix of Z. If μ is the starting distribution of Z1, we hence get Using this approach (and a binary decomposition of n - 1), it is possible to compute the p-value with O(log2(n) × L2) memory complexity and O(log2(n) × L3) time complexity. As L usually grows very fast when we consider more complex events En, these complexities are a huge drawback of the method. Moreover, numerical precision considerations prevent this approach to give accurate results when using the relation ℙ() = 1 - ℙ(En) to compute the p-value of the complementary event (as the absolute error is then equal to the relative precision of the computations). 2.2 Effective computations Proposition 1. For all n ≥ 1 we have Proof. This trivial to establish by recurrence using matrix block multiplications.     □ We hence get the Corollary 2 (direct p-value). For all n ≥ 1 we have for all 1 ≤ i ≤ k     ℙ(En|X1 = si) =     and         (5) with yn-2 computable through the following recurrence relations: x0 = y0 = v     and, for all j ≥ 0     xj+1 = Rxj     and     yj+1 = yj+xj     (6) Proof. Simply use proposition 1 to rewrite equations (1) and (3). Recurrence relations are then obvious to establish.     □ And we also get the Corollary 3 (complementary p-value). For all n ≥ 1 we have for all 1 ≤ i ≤ k     ℙ(|X1 = si) =     and         (7) with x0 is a size L - 1 column vector filled with ones and with xn-1 = Rn-1x0 which is computable through the following recurrence relation: for all j ≥ 0     xj+1 = Rxj     (8) Proof. ∏ being a stochastic matrix, ∏n-1 is also stochastic, it is therefore clear that the sum of Rn-1 over the columns gives 1 - yn-2 and the corollary is proved.     □ Using these two corollaries, it is therefore possible to accurately compute the p-value of the event or of its complementary with a complexity O(L + ζ) in memory and O(n × ζ) in time where ζ is the number of non zero terms in the matrix R. In the worst case, ζ = (L - 1)2 but the technique of FMCI usually leads to a very sparse structure for R. One should note that these dramatic improvements from the naive approach could even get better by considering the structure of R itself, but this have to be done specifically for each considered problem. We will give detailed examples of this in both our application parts but, for the moment, we focus on the general case for which we give algorithms. 2.3 Algorithms Using with the corollary 2 we get a simple algorithm to compute p = ℙ(En) algorithm 1: direct p-value x is a real column vector of size L - 1 and y a real column vector of size k initialization x = (v1,...,vL-1)' and y = (v1,...,vk)' main loop for i = 1...n - 2 do     • x = R × x (sparse product)     • y = y + (x1,...,xk)' end return and using the corollary 3 we get an even simpler algorithm to compute the q = 1 - p = ℙ() algorithm 2: complementary p-value x is a real column vector of size L - 1 initialization x = (1,...,1)' main loop for i = 1...n - 1 do     • x = R × x (sparse product) end return The more critical stage of both these algorithms is the sparse product of the matrix R by a column vector which can be efficiently done with ζ operations. It is interesting to point out the fact that these algorithms do not require the stationarity of the underlying Markov chain. More surprisingly, it is also possible to relax the random sequence homogeneity assumption. Indeed, if our transition matrix ∏ depends on the position i in the sequence, we simply have to replace R in the algorithms with the corresponding Ri (which may use a significant amount of additional memory depending on its expression as a function of i). For complementary p-value, we require to compute R1R2...Rn-1Rnx which is easily done recursively starting from the right. In the direct p-value case however, it seems more difficult since we need to compute x + R1x + R1R2x + ... + R1R2...Rn-1Rnx. Fortunately this sum can be rewritten as x + R1(x + R2{... [x + Rn-1(x + Rnx)]...}) which is again easy to compute recursively starting from the right. The resulting complexities in the heterogeneous case are hence the same than in the homogeneous one (assuming that the number of non zero terms in Ri remains approximately constant). This remarkable property of the FMCI should be remembered especially in the biological field where most sequences are known to have complex heterogeneous structures which are often difficult to take into account. 3 Application 1: local score We propose in this part to apply our results to the computation of exact p-values for local score. We first recall the definition of the local score of one sequence (section 3.1) and design a FMCI allowing to compute p-value in the particular case of an integer and i.i.d. score (section 3.2). We explain in sections 3.5 and 3.6 how to relax these two restrictive assumptions to consider rational or Markovian scores. The main result of this part is given in section 3.4 where we propose an algorithm improving the simple application of the general ones by using a specific asymptotic behaviour presented in section 3.3. As numerical application, we propose finally in section 3.7 to find significant hydrophobic segments in the Swissprot database using the Kyte-Doolittle hydrophobic scale. Our exact results are compared to the classical Gumble asymptotic approximations and discussed both in terms of numerical performance and reliability. 3.1 Definition We consider S = S1,...,Sn a sequence of real scores and we define the local score Hn of this sequence by which is exactly the highest partial sum score of a subsequence of S. This local score can be computed in O(n) using the auxiliary process U0 = 0     and for 1 ≤ j ≤ n     = max{0, Uj-1 + Sj}     (10) because we then have Hn = maxj Uj. Assuming the sequence S is random (Bernoulli or Markov model), we want to compute p-values relative to the event En = {Hn ≥ a} where a > 0. 3.2 Integer score In order to simplify, we will first consider the case of integer scores (and hence a ∈ ≁) then we will extend the result to the case of rational scores. In the Bernoulli case, [12] introduced the FMCI Z defined by (resulting with a sequence of length n + 1) with 0 as the only starting state and a as the final absorbing state. The transition matrix ∏ is given by where p(i) = ℙ(S1 = i)     f(i) = ℙ(S1 ≤ i)     g(i) = ℙ(S1 ≥ i)     ∀i ∈ ℤ     (13) It is possible to apply to this case the general algorithm 1 with L = a + 1 and k = 1 (please note that we have added Z0 to the sequence and n must then be replaced by n + 1 in the algorithm to get correct computations) to compute the p-value we are looking for. In the worst case, R has ζ = a2 non zero terms and the resulting complexity is O(a2) in memory and O(n × a2) in times. But in most cases, S1 support is reduced to a small number of values and the complexities decrease accordingly. 3.3 Asymptotic development Is it possible to compute this p-value faster ? In the case where R admits a diagonal form, simple linear algebra could help to cut off the computations and answer yes to this question. Proposition 4. If R admits a diagonal form we have where []1 denotes the first component of a vector, with R∞ = limi→∞ Ri/λi, where 0 <λ < 1 is the largest eigenvalue of R and ν is the magnitude of the second largest eigenvalue. We also have v = [g(a),...,g(1)]'. Proof. By using the corollary 15 (appendix A) we know that Ri - λiR∞ = O(νi)     (15) uniformly in i so we finally get for all α uniformly for all n ≥ α and the proposition is then proved by considering the first component of equation (16).     □ Corollary 5. We have and Proof. Simply replace the terms in (17) and (18) with equation (14) to get the results.     □ 3.4 Algorithm The simplest way to compute ℙ(Hn ≥ a) is to use the algorithm 2 in our particular case. As the number of non zero terms in R is then a2, the resulting complexity is O(n × a2). Using the proposition 4, it possible to get the same result a bit faster on very long sequence by computing the first two largest eigenvalues magnitudes λ and ν (complexity in O(a2) with Arnoldi algorithms) and to use them to compute a p-value. As the absolute error is in O(να) we obtain a require ε error level using a α proportional to log(ε)/log(ν) which results in a final complexity in O(log(ε)/log(ν) × a2). Unfortunately, this last method requires to use delicate linear algebra techniques and is therefore more difficult to implement. Another better possibility is to use the corollary 5 to get the following fast and easy to implement algorithm: algorithm 3: local score p-value x a real column vector of size a, (pi)i≥1 and (λi)i≥3 to sequences of real and i an integer initialization x = [g(a),...,g(1)]', p1 = g(a), and i = 0 main loop while (i 1 Markov chain by modifying accordingly our vertex set. For example, if we consider an order m = 2 Markov model our vertex set becomes = {aa, ab, ba, bb, aba, abaa, abab} In all cases, if we denote by L the cardinal of . In order to count overlapping occurrences of a non degenerate pattern of length h on a size k alphabet we get L = k + h - 2 when an order 1 Markov model is considered and L = km + h - m - 1 for an order m > 1 Markov model. For a degenerate pattern of length h, L is more difficult to know as it depends on the degeneracy of the patterns, in the worst case L = kh-1, but L should be far smaller in most cases. One should note that L increases by the number of different words present in the pattern if we consider renewal occurrences instead of overlapping ones. Although construction and properties of DFA are well known in the theory of language and automata ([8]), their connexions to pattern statistics have surprisingly not been extensively studied in the literature. In particular, the strong relation presented here between the FMCI technique for pattern and DFA appears to have never been highlighted before. If this interesting subject obviously need to (and will soon) be investigated more deeply, it is not really the purpose of this article which focus more on the algorithmic treatment of a built FMCI. 4.3 FMCI Once a DFA and the corresponding matrices P and Q have been built, it is easy to get a FMCI allowing to compute the p-values we are looking for. Let us consider where Yj is the sequence of vertexes, Nj is the number of pattern occurrences in the sequence Y1...Yj (or X = X1...Xj as it is the same), where f is the final (absorbing state) and where a ∈ ℕ is the observed number of occurrences Nobs if the pattern is over-represented and Nobs + 1 if it is under-represented. The transition matrix of the Markov chain Z is then given by: where for all size L blocks i, j we have with ΣQ, the column vector resulting from the sum of Q. By plugin the structure of R and v in the corollaries 2 and 3 we get the following recurrences: Proposition 6. For all n ≥ 1 and 1 ≤ i ≤ k we have where for x = u or v we have ∀j ≥ 0 the following size L block decomposition: and we have the recurrence relations: with u0 = (1...1)' and v0 = v. 4.4 Algorithms Using the proposition 6 it is possible to get an algorithm computing our pattern statistic for an under-represented pattern observed Nobs times: algorithm 4under: exact statistics for under-represented pattern x0,..., and y0,..., are 2 × (Nobs + 1) real column vectors of size L initialization for j = 0...Nobs do xj = (1,...,1)' main loop for i = 1...(n - 1) do     • for j = 0...Nobs do yj = xj     • x0 = P × y0     • for j = 1...Nobs do xj = P × yj + Q × yj-1 end •     • return log10(q) If we consider now an over-represented pattern we get algorithm 4over: exact statistics for over-represented pattern x1,...,, y1,..., and z are 2Nobs + 1 real column vectors of size L initialization z = (0,...,0)', x1 = ΣQ and for j = 2...Nobs do xj = (0,...,0)' main loop for i = 1...(n - 2) do     • for j = 1...Nobs do yj = xj     • x1 = P × y1     • for j = 2...Nobs do xj = P × yj + Q × yj-1     • z = z + end •     • return -log10(p) As we have O(k × L) non zero terms in P + Q, the complexity of both of these algorithms is O(k × L + Nobs × L) in memory and O(k × L × n × Nobs) in time. To compute p-values out of floating point range (ex: smaller than 10-300 with C double), it is necessary to use log computations in the algorithms (not detailed here). The resulting complexity stays the same but the empirical running time is obviously slower. That is why we advise to use log-computation only when it is necessary (for example by considering first a rough approximation). 4.5 Asymptotic developments In this part we propose to derive asymptotic developments for pattern p-values from their recursive expressions. For under- (resp. over-) represented patterns, the main result is given in theorem 9 (resp. 12). In both cases, theses results are also presented in a simpler form (where only main terms are taken into account) in the following corollaries. Proposition 7. For any x = (x(a-1),...,x0)' and all β ≥ 0 xβ = Rβx is given by = Pβ and Proof. As = for all j ≤ 0 it is trivial to get the expression of . If we suppose now that the relation (28) is true for some i and β then, thanks to the relation (27) we have and so the proposition is proved through the principle of recurrence.     □ Lemma 8. For all i ≥ 0 and a ≤ b ∈ ℕ and r > 0 we define If r ≠ 1 we have for all i ≥ 0 we have and (case r = 1) for all i ≥ 0 we have Proof. Easily derived from the following relation Theorem 9. If P is primitive and admits a diagonal form we denote by λ > ν the largest two eigenvalues magnitude of P by P∞ = limi→+∞ Pi/λi (a positive matrix) and we get for all α ≥ 1 and i ≥ 0 uniformly in β and where is a polynomial of degree i which is defined by and for all i ≥ 1 by the following recurrence relation: Proof. See appendix B.     □ Corollary 10. With the same assumptions than in the theorem 9, for all α ≥ 1 and β ≥ (i+1)α we have Proof. Equation (37) and the lemma 8 gives and the result is then proved by a simple recurrence.     □ Proposition 11. For any x = (x(a-1),...,x0)' and all β ≥ 0, is given by and Proof. Using equation (28) we get which gives the proposition.     □ From this result (very similar to proposition 7) it is possible to get a new theorem Theorem 12. If P is primitive and admits a diagonal form we denote by λ > ν the largest two eigenvalues magnitude of P by P∞ = limi→+∞ Pi/λi (a positive matrix) and we get for all α ≥ 1 and i ≥ 0 uniformly in β and where is a constant term defined by and for all i ≥ 0 by the following recurrence relation and is a polynomial of degree i which is defined by and for all i ≥ 1 by the following recurrence relation: Proof. Easy to derive from the proof of theorem 9.     □ Corollary 13. We have the same assumptions than in the the theorem 12, for all α ≥ 1 and β ≥ (i + 1)α we have Proof. Easy to derive from the proof of corollary 10.     □ 4.6 Numerical results We propose here to consider numerical applications of these new FMCI pattern statistics algorithms and to compare their results and performance to exact computations using Simple Recurrences ([17] and [18]) denoted SR from now. All computations are performed using SPatt-1.2.0 package (see [14] or [15]) on a 2.7 Gz P4 with 512 Mo running the linux 2.6.8 system. As we can see on table 4, the computational time to obtain the pattern statistics for all simple words of a given length are quite similar with both approaches. One exception: the Bernoulli case (m = 0) where SR are roughly 10 times faster than FMCI computations. This is due to the fact that the Bernoulli case uses simpler recurrences than the ones used for the true Markovian cases (m ≥ 1). Similar simplifications in the DFA structures can reduce the computational time of FMCI approach in the independent case but they have not been implemented here (as their use is often marginal). Table 4 Computational time (in seconds) to get the statistics of all DNA words of length h in the HIV complete genome sequence (n = 9719) using either simple recurrences of finite Markov chain imbedding and in respect with an order m Markov model estimated on the genome. Markov order word length m = 0 m = 1 m = 2 h = 3 h = 4 h = 5 h = 3 h = 4 h = 5 h = 3 h = 4 h = 5 SR 3 4 5 61 59 62 104 102 106 FMCI 39 45 52 39 44 52 96 102 113 If we consider now degenerate patterns instead of simple words (see table 5), FMCI approach clearly outperforms the SR one. Nevertheless, as considering degenerated patterns roughly multiply their observed number of occurrences by the alphabet size for each inde-termination, the corresponding computational time grows in the same manner which usually limits the use of high degenerated patterns in practical cases. Table 5 Computational time (in seconds) to get the statistics of degenerate patterns (the dot means "any letter") occurring 100 times in an order m = 1 Markovian sequence of length n = 9719 which parameters are estimated on the HIV complete genome sequence using either simple recurrences of finite Markov chain imbedding. pattern atgca at.ca at..a a...a SR 0.60 8.20 2438.43 105 209.12 FMCI 0.51 1.15 3.01 91.80 Another interesting point is the memory requirements of the two approaches. Exact computations using SR have a O(n + α × k2m) memory complexity where n is the sequence length, k the alphabet size, m the Markov model order and α which depends on the convergence rate of the model towards its stationary distribution. As a consequence, SR is difficult to use in practice with m > 3 for DNA words or m > 1 for protein ones. For FMCI computations, the memory requirements remain very cheap and in practice, any Markov model that fit in memory can be considered. What about the reliability of the two methods. Once the pattern DFA has been computed, the FMCI algorithms are very simple to implement and have a high numerical stability. On the other hand, SR algorithms are quite more complicated (especially for degenerated patterns) to implement and require to approximate the iterate power of the Markov transition by the stationary distribution for large iterates. Classical convergence issues could result then to some numerical instability when high Markov orders are considered. As a consequence, FMCI results are taken as references from this point. In table 6 we can see that for p-values larger than 10-300 the results given by both methods are exactly the same when we consider order 0 Markov models. As smaller p-values are not well managed by C double precision computation (the exact limit depends on the system), we get wrong results unless log computations are used. Such computations have been implemented for FMCI algorithms (they are quite simple) but not for SR ones (where it is quite more complicated) which explain the differences for patterns at and tcgatc. Table 6 Reliability of pattern statistics. They are computed in respect with an order m Markovian sequence of length n = 9719 which parameters are estimated on the HIV complete genome. Relative error uses FMCI statistics as reference. pattern order observed expected FMCI SR relative error acta 0 106 48.63 +12.208 +12.208 0.0 acta 1 106 47.01 +13.090 +13.079 8.7 × 10-4 acta 0 26 48.63 -3.567 -3.567 0.0 acta 1 26 47.01 -3.231 -3.230 4.8 × 10-4 acta 0 6 48.63 -13.856 -13.850 3.7 × 10-4 acta 1 6 47.01 -13.237 -13.237 3.0 × 10-5 at 0 50 759.48 -291.610 -291.610 0.0 at 0 25 759.48 -327.214 -318.192 2.8 × 10-2 at 0 0 759.48 -377.009 -319.607 1.5 × 10-1 tcgatc 0 185 1.37 +294.997 +294.997 0.0 tcgatc 0 195 1.37 +314.388 +314.388 5.7 × 10-8 tcgatc 0 205 1.37 +333.931 na na acacaa 2 10 6.66 +0.865 +0.855 1.1 × 10-2 acacaa 2 20 6.66 +4.669 +4.520 3.2 × 10-2 acacaa 2 60 6.66 +35.751 +33.532 6.2 × 10-2 acacaa 2 100 6.66 +79.736 +73.451 7.8 × 10-2 Table 7 tag\vertex a b ab aba abaa abab a a a aba abaa a* aba* b ab b b abab ab b When we consider order m > 0 Markov models, the numerical approximations done on the iterate power of the transition matrix lead to some errors. For order 1 Markov model, these errors remain quite small, but when order 2 model are considered it is more sensitive. In both cases, the larger the statistic to compute is, the greater the errors made are. 5 Conclusion We proposed in this paper two general algorithms allowing to compute quickly and in a stable numerical way any p-value that can be imbedded in a finite Markov chain. We used these algorithms in two applications: local score on one sequence and pattern statistics. For local score, the resulting algorithms reduce dramatically the complexity of previously proposed naive ones allowing for the very first time to produce exact computations for practical biological studies (Kyte-Doolittle hydrophobic scale on the Swissprot database). Comparing the results to the classical and very popular Karlin's approximations, it appears that these approximations require long sequences (length greater than 2 000) which can dramatically reduce their range of applicability (only 0.5% of the data in our example). Of course, the exact computations require more time than the approximations, but are nevertheless fast enough (20 p-value per second in our example) to be used in most practicable cases. As a consequence we strongly advise to replace asymptotic approximations by these new exact ones whenever it is possible. Concerning pattern statistics, the new FMCI algorithms appear to outperform the existing ones ([17]) in all possible ways: far easier to implement, more numerical stability, less memory requirements, as fast as SR for simple words (except in the M0 case, but this is due to a poor implementation of this particular case in FMCI approach) and dramatically faster (up to 1 000 times and more) for degenerated patterns. Even if the SR algorithms remain available in the SPatt package, FMCI ones are now used by default for exact computations. Appendix A: power of a sub-stochastic matrix Proposition 14. If P is an order L irreducible sub-stochastic matrix admitting a row-eigenvector basis (e1,...,eL) where each ej, is associated to the eigenvalue λj and |λ1| ≥ |λ2| ≥...≥ |λL| then we have where λ = |λ1| = λ1, P∞ = and ∀j Proof. For any vector we have As P is an irreducible, Perron-Frobénius theorem (see [3] for example) assure that λ is real and the sub-stochastic property, that λ ≤ 1 (in the particular case where P is also primitive i.e. ∃m, Pm > 0 then |λ2| <λ) Replacing x by Iℓ equation (51) gives the expression of the row ℓ of Pi and the proposition is then proved.     □ Corollary 15. If P is an order ≥ 2 irreducible sub-stochastic matrix admitting a diagonal form then there exists a matrix P∞ such as Pi - λiP∞ = O(νi)     (52) uniformly in i and where 1 ≥ λ ≥ ν are the two largest eigenvalues magnitudes. In the special case where P is primitive, then λ > ν and P∞ = limi→+∞ Pi/λi is a positive matrix. Proof. Using proposition 14 we get and the corollary is proved.     □ Appendix B: proof of theorem 9 Using propositions 7 and 14 we get the result for i = 0. Let us prove the theorem by the principle of recurrence. We assume now that the result is true until rank i - 1. Computing for all β ≥ (i + 1)α we get For all 1 ≤ j ≤ iα we have β - j ≥ β - iα ≥ α so we get For all iα + 1 ≤ j ≤ β - α we have j - 1 ≥ iα and β - j ≥ β - iα ≥ α and so, with the help of lemma 8 we get thanks to the recurrence assumption, it is easy to see than B contribute with polynomial terms of degree i (as we have a sum on O(β) terms). For all β - α + 1 ≤ j ≤ β we have j - 1 ≥ iα so we get C contributing with polynomial terms of degree i - 1 (as we have a sum on α terms) Summing up all terms we get the result at rank i and the theorem is proved. Acknowledgements I would like to warmly thank Sabine Mercier for her kind help and her enthusiastic feedback during the writing of this article.

Document structure show

Annnotations

blinded