4.6 Numerical results
We propose here to consider numerical applications of these new FMCI pattern statistics algorithms and to compare their results and performance to exact computations using Simple Recurrences ([17] and [18]) denoted SR from now.
All computations are performed using SPatt-1.2.0 package (see [14] or [15]) on a 2.7 Gz P4 with 512 Mo running the linux 2.6.8 system.
As we can see on table 4, the computational time to obtain the pattern statistics for all simple words of a given length are quite similar with both approaches. One exception: the Bernoulli case (m = 0) where SR are roughly 10 times faster than FMCI computations. This is due to the fact that the Bernoulli case uses simpler recurrences than the ones used for the true Markovian cases (m ≥ 1). Similar simplifications in the DFA structures can reduce the computational time of FMCI approach in the independent case but they have not been implemented here (as their use is often marginal).
Table 4  Computational time (in seconds) to get the statistics of all DNA words of length h in the HIV complete genome sequence (n = 9719) using either simple recurrences of finite Markov chain imbedding and in respect with an order m Markov model estimated on the genome.
Markov order word length  m = 0  m = 1  m = 2
h = 3  h = 4  h = 5  h = 3  h = 4  h = 5  h = 3  h = 4  h = 5
SR  3  4  5  61  59  62  104  102  106
FMCI  39  45  52  39  44  52  96  102  113 If we consider now degenerate patterns instead of simple words (see table 5), FMCI approach clearly outperforms the SR one. Nevertheless, as considering degenerated patterns roughly multiply their observed number of occurrences by the alphabet size for each inde-termination, the corresponding computational time grows in the same manner which usually limits the use of high degenerated patterns in practical cases.
Table 5  Computational time (in seconds) to get the statistics of degenerate patterns (the dot means "any letter") occurring 100 times in an order m = 1 Markovian sequence of length n = 9719 which parameters are estimated on the HIV complete genome sequence using either simple recurrences of finite Markov chain imbedding.
pattern  atgca  at.ca  at..a  a...a
SR  0.60  8.20  2438.43  105 209.12
FMCI  0.51  1.15  3.01  91.80 Another interesting point is the memory requirements of the two approaches. Exact computations using SR have a O(n + α × k2m) memory complexity where n is the sequence length, k the alphabet size, m the Markov model order and α which depends on the convergence rate of the model towards its stationary distribution. As a consequence, SR is difficult to use in practice with m > 3 for DNA words or m > 1 for protein ones. For FMCI computations, the memory requirements remain very cheap and in practice, any Markov model that fit in memory can be considered.
What about the reliability of the two methods. Once the pattern DFA has been computed, the FMCI algorithms are very simple to implement and have a high numerical stability. On the other hand, SR algorithms are quite more complicated (especially for degenerated patterns) to implement and require to approximate the iterate power of the Markov transition by the stationary distribution for large iterates. Classical convergence issues could result then to some numerical instability when high Markov orders are considered. As a consequence, FMCI results are taken as references from this point.
In table 6 we can see that for p-values larger than 10-300 the results given by both methods are exactly the same when we consider order 0 Markov models. As smaller p-values are not well managed by C double precision computation (the exact limit depends on the system), we get wrong results unless log computations are used. Such computations have been implemented for FMCI algorithms (they are quite simple) but not for SR ones (where it is quite more complicated) which explain the differences for patterns at and tcgatc.
Table 6  Reliability of pattern statistics. They are computed in respect with an order m Markovian sequence of length n = 9719 which parameters are estimated on the HIV complete genome. Relative error uses FMCI statistics as reference.
pattern  order  observed  expected  FMCI  SR  relative error
acta  0  106  48.63  +12.208  +12.208  0.0
acta  1  106  47.01  +13.090  +13.079  8.7 × 10-4
acta  0  26  48.63  -3.567  -3.567  0.0
acta  1  26  47.01  -3.231  -3.230  4.8 × 10-4
acta  0  6  48.63  -13.856  -13.850  3.7 × 10-4
acta  1  6  47.01  -13.237  -13.237  3.0 × 10-5
at  0  50  759.48  -291.610  -291.610  0.0
at  0  25  759.48  -327.214  -318.192  2.8 × 10-2
at  0  0  759.48  -377.009  -319.607  1.5 × 10-1
tcgatc  0  185  1.37  +294.997  +294.997  0.0
tcgatc  0  195  1.37  +314.388  +314.388  5.7 × 10-8
tcgatc  0  205  1.37  +333.931  na  na
acacaa  2  10  6.66  +0.865  +0.855  1.1 × 10-2
acacaa  2  20  6.66  +4.669  +4.520  3.2 × 10-2
acacaa  2  60  6.66  +35.751  +33.532  6.2 × 10-2
acacaa  2  100  6.66  +79.736  +73.451  7.8 × 10-2
Table 7  tag\vertex  a  b  ab  aba  abaa  abab
a  a  a  aba  abaa  a*  aba*
b  ab  b  b  abab  ab  b When we consider order m > 0 Markov models, the numerical approximations done on the iterate power of the transition matrix lead to some errors. For order 1 Markov model, these errors remain quite small, but when order 2 model are considered it is more sensitive. In both cases, the larger the statistic to compute is, the greater the errors made are.