4 Application 2: pattern statistics In this part, we consider the application of FMCI to pattern statistics. After a short introduction of notations (section 4.1) we explain with an example in section 4.2 how to build through the tool of DFA a particular FMCI related to a given pattern. The block structure of this FMCI (section 4.3) is then used to get in section 4.4 two efficient algorithms for under- and over-represented patterns. We derive in section 4.5 some asymptotic developments but unlike with local score application, these results are not used to improve our algorithms. In the last section 4.6 we finally compare this new method to existing ones. 4.1 Definition Let us consider a random order m homogeneous Markov sequence X = X1,...,Xn on the finite alphabet (cardinal k). If Ni is the random variable counting the number of occurrences (overlapping or renewal) of a given pattern in X1...Xi. We define the pattern statistic associated to any number Nobs ∈ ℕ of observations by This way, a pattern has a positive statistic if it is seen more than expected, a negative statistic if seen less than expected and, in both cases, the corresponding p-value is given (in log scale) by the magnitude of the statistic. The problem is: how to compute this statistic ? 4.2 DFA We first need to construct a Deterministic Finite state Automaton (DFA) able to count our pattern occurrences. It is a finite oriented graph such as all vertexes have exactly k arcs starting from them each one tagged with a different letter of . One or more arcs are marked as counting ones. By processing a sequence X in the DFA, we get a sequence Y (of vertexes) in which the words of length 2 corresponding to the counting transitions occur each time a pattern occurs in X. Example: If we consider the pattern aba.a (. means "any letter") on the binary alphabet = {a, b}. We define vertex set = {a, b, ab, aba, abaa, abab} and then the structure of the DFA counting the overlapping occurrences (set of vertexes and structure would have been slightly different in the renewal case) of the pattern is given by (the counting arcs are denoted by a star). In the sequence of length n = 20, the pattern occurrences end in positions 9,11 and 18. Processing this sequence into the DFA gives which is a sequence of the same length as X, where occurrences of the pattern end exactly in the same positions. If X is an homogeneous order one Markov chain, so is Y and its transition matrix is given by P + Q where P contains the non counting transitions and Q the counting ones: and It is therefore possible to work on Y rather than on X to compute the pattern statistics. In order to do that, it is very natural to use the large deviations (in this case, computations are closely related to the largest eigenvalue of the matrix Tθ = P + Qeθ) but other methods can be used as well (binomial or compound Poisson approximations for example). This method easily extends to cases where X is an order m > 1 Markov chain by modifying accordingly our vertex set. For example, if we consider an order m = 2 Markov model our vertex set becomes = {aa, ab, ba, bb, aba, abaa, abab} In all cases, if we denote by L the cardinal of . In order to count overlapping occurrences of a non degenerate pattern of length h on a size k alphabet we get L = k + h - 2 when an order 1 Markov model is considered and L = km + h - m - 1 for an order m > 1 Markov model. For a degenerate pattern of length h, L is more difficult to know as it depends on the degeneracy of the patterns, in the worst case L = kh-1, but L should be far smaller in most cases. One should note that L increases by the number of different words present in the pattern if we consider renewal occurrences instead of overlapping ones. Although construction and properties of DFA are well known in the theory of language and automata ([8]), their connexions to pattern statistics have surprisingly not been extensively studied in the literature. In particular, the strong relation presented here between the FMCI technique for pattern and DFA appears to have never been highlighted before. If this interesting subject obviously need to (and will soon) be investigated more deeply, it is not really the purpose of this article which focus more on the algorithmic treatment of a built FMCI. 4.3 FMCI Once a DFA and the corresponding matrices P and Q have been built, it is easy to get a FMCI allowing to compute the p-values we are looking for. Let us consider where Yj is the sequence of vertexes, Nj is the number of pattern occurrences in the sequence Y1...Yj (or X = X1...Xj as it is the same), where f is the final (absorbing state) and where a ∈ ℕ is the observed number of occurrences Nobs if the pattern is over-represented and Nobs + 1 if it is under-represented. The transition matrix of the Markov chain Z is then given by: where for all size L blocks i, j we have with ΣQ, the column vector resulting from the sum of Q. By plugin the structure of R and v in the corollaries 2 and 3 we get the following recurrences: Proposition 6. For all n ≥ 1 and 1 ≤ i ≤ k we have where for x = u or v we have ∀j ≥ 0 the following size L block decomposition: and we have the recurrence relations: with u0 = (1...1)' and v0 = v. 4.4 Algorithms Using the proposition 6 it is possible to get an algorithm computing our pattern statistic for an under-represented pattern observed Nobs times: algorithm 4under: exact statistics for under-represented pattern x0,..., and y0,..., are 2 × (Nobs + 1) real column vectors of size L initialization for j = 0...Nobs do xj = (1,...,1)' main loop for i = 1...(n - 1) do     • for j = 0...Nobs do yj = xj     • x0 = P × y0     • for j = 1...Nobs do xj = P × yj + Q × yj-1 end •     • return log10(q) If we consider now an over-represented pattern we get algorithm 4over: exact statistics for over-represented pattern x1,...,, y1,..., and z are 2Nobs + 1 real column vectors of size L initialization z = (0,...,0)', x1 = ΣQ and for j = 2...Nobs do xj = (0,...,0)' main loop for i = 1...(n - 2) do     • for j = 1...Nobs do yj = xj     • x1 = P × y1     • for j = 2...Nobs do xj = P × yj + Q × yj-1     • z = z + end •     • return -log10(p) As we have O(k × L) non zero terms in P + Q, the complexity of both of these algorithms is O(k × L + Nobs × L) in memory and O(k × L × n × Nobs) in time. To compute p-values out of floating point range (ex: smaller than 10-300 with C double), it is necessary to use log computations in the algorithms (not detailed here). The resulting complexity stays the same but the empirical running time is obviously slower. That is why we advise to use log-computation only when it is necessary (for example by considering first a rough approximation). 4.5 Asymptotic developments In this part we propose to derive asymptotic developments for pattern p-values from their recursive expressions. For under- (resp. over-) represented patterns, the main result is given in theorem 9 (resp. 12). In both cases, theses results are also presented in a simpler form (where only main terms are taken into account) in the following corollaries. Proposition 7. For any x = (x(a-1),...,x0)' and all β ≥ 0 xβ = Rβx is given by = Pβ and Proof. As = for all j ≤ 0 it is trivial to get the expression of . If we suppose now that the relation (28) is true for some i and β then, thanks to the relation (27) we have and so the proposition is proved through the principle of recurrence.     □ Lemma 8. For all i ≥ 0 and a ≤ b ∈ ℕ and r > 0 we define If r ≠ 1 we have for all i ≥ 0 we have and (case r = 1) for all i ≥ 0 we have Proof. Easily derived from the following relation Theorem 9. If P is primitive and admits a diagonal form we denote by λ > ν the largest two eigenvalues magnitude of P by P∞ = limi→+∞ Pi/λi (a positive matrix) and we get for all α ≥ 1 and i ≥ 0 uniformly in β and where is a polynomial of degree i which is defined by and for all i ≥ 1 by the following recurrence relation: Proof. See appendix B.     □ Corollary 10. With the same assumptions than in the theorem 9, for all α ≥ 1 and β ≥ (i+1)α we have Proof. Equation (37) and the lemma 8 gives and the result is then proved by a simple recurrence.     □ Proposition 11. For any x = (x(a-1),...,x0)' and all β ≥ 0, is given by and Proof. Using equation (28) we get which gives the proposition.     □ From this result (very similar to proposition 7) it is possible to get a new theorem Theorem 12. If P is primitive and admits a diagonal form we denote by λ > ν the largest two eigenvalues magnitude of P by P∞ = limi→+∞ Pi/λi (a positive matrix) and we get for all α ≥ 1 and i ≥ 0 uniformly in β and where is a constant term defined by and for all i ≥ 0 by the following recurrence relation and is a polynomial of degree i which is defined by and for all i ≥ 1 by the following recurrence relation: Proof. Easy to derive from the proof of theorem 9.     □ Corollary 13. We have the same assumptions than in the the theorem 12, for all α ≥ 1 and β ≥ (i + 1)α we have Proof. Easy to derive from the proof of corollary 10.     □ 4.6 Numerical results We propose here to consider numerical applications of these new FMCI pattern statistics algorithms and to compare their results and performance to exact computations using Simple Recurrences ([17] and [18]) denoted SR from now. All computations are performed using SPatt-1.2.0 package (see [14] or [15]) on a 2.7 Gz P4 with 512 Mo running the linux 2.6.8 system. As we can see on table 4, the computational time to obtain the pattern statistics for all simple words of a given length are quite similar with both approaches. One exception: the Bernoulli case (m = 0) where SR are roughly 10 times faster than FMCI computations. This is due to the fact that the Bernoulli case uses simpler recurrences than the ones used for the true Markovian cases (m ≥ 1). Similar simplifications in the DFA structures can reduce the computational time of FMCI approach in the independent case but they have not been implemented here (as their use is often marginal). Table 4 Computational time (in seconds) to get the statistics of all DNA words of length h in the HIV complete genome sequence (n = 9719) using either simple recurrences of finite Markov chain imbedding and in respect with an order m Markov model estimated on the genome. Markov order word length m = 0 m = 1 m = 2 h = 3 h = 4 h = 5 h = 3 h = 4 h = 5 h = 3 h = 4 h = 5 SR 3 4 5 61 59 62 104 102 106 FMCI 39 45 52 39 44 52 96 102 113 If we consider now degenerate patterns instead of simple words (see table 5), FMCI approach clearly outperforms the SR one. Nevertheless, as considering degenerated patterns roughly multiply their observed number of occurrences by the alphabet size for each inde-termination, the corresponding computational time grows in the same manner which usually limits the use of high degenerated patterns in practical cases. Table 5 Computational time (in seconds) to get the statistics of degenerate patterns (the dot means "any letter") occurring 100 times in an order m = 1 Markovian sequence of length n = 9719 which parameters are estimated on the HIV complete genome sequence using either simple recurrences of finite Markov chain imbedding. pattern atgca at.ca at..a a...a SR 0.60 8.20 2438.43 105 209.12 FMCI 0.51 1.15 3.01 91.80 Another interesting point is the memory requirements of the two approaches. Exact computations using SR have a O(n + α × k2m) memory complexity where n is the sequence length, k the alphabet size, m the Markov model order and α which depends on the convergence rate of the model towards its stationary distribution. As a consequence, SR is difficult to use in practice with m > 3 for DNA words or m > 1 for protein ones. For FMCI computations, the memory requirements remain very cheap and in practice, any Markov model that fit in memory can be considered. What about the reliability of the two methods. Once the pattern DFA has been computed, the FMCI algorithms are very simple to implement and have a high numerical stability. On the other hand, SR algorithms are quite more complicated (especially for degenerated patterns) to implement and require to approximate the iterate power of the Markov transition by the stationary distribution for large iterates. Classical convergence issues could result then to some numerical instability when high Markov orders are considered. As a consequence, FMCI results are taken as references from this point. In table 6 we can see that for p-values larger than 10-300 the results given by both methods are exactly the same when we consider order 0 Markov models. As smaller p-values are not well managed by C double precision computation (the exact limit depends on the system), we get wrong results unless log computations are used. Such computations have been implemented for FMCI algorithms (they are quite simple) but not for SR ones (where it is quite more complicated) which explain the differences for patterns at and tcgatc. Table 6 Reliability of pattern statistics. They are computed in respect with an order m Markovian sequence of length n = 9719 which parameters are estimated on the HIV complete genome. Relative error uses FMCI statistics as reference. pattern order observed expected FMCI SR relative error acta 0 106 48.63 +12.208 +12.208 0.0 acta 1 106 47.01 +13.090 +13.079 8.7 × 10-4 acta 0 26 48.63 -3.567 -3.567 0.0 acta 1 26 47.01 -3.231 -3.230 4.8 × 10-4 acta 0 6 48.63 -13.856 -13.850 3.7 × 10-4 acta 1 6 47.01 -13.237 -13.237 3.0 × 10-5 at 0 50 759.48 -291.610 -291.610 0.0 at 0 25 759.48 -327.214 -318.192 2.8 × 10-2 at 0 0 759.48 -377.009 -319.607 1.5 × 10-1 tcgatc 0 185 1.37 +294.997 +294.997 0.0 tcgatc 0 195 1.37 +314.388 +314.388 5.7 × 10-8 tcgatc 0 205 1.37 +333.931 na na acacaa 2 10 6.66 +0.865 +0.855 1.1 × 10-2 acacaa 2 20 6.66 +4.669 +4.520 3.2 × 10-2 acacaa 2 60 6.66 +35.751 +33.532 6.2 × 10-2 acacaa 2 100 6.66 +79.736 +73.451 7.8 × 10-2 Table 7 tag\vertex a b ab aba abaa abab a a a aba abaa a* aba* b ab b b abab ab b When we consider order m > 0 Markov models, the numerical approximations done on the iterate power of the transition matrix lead to some errors. For order 1 Markov model, these errors remain quite small, but when order 2 model are considered it is more sensitive. In both cases, the larger the statistic to compute is, the greater the errors made are.