Choice of a Markov model order
Through the computation of σ we can measure the sensitivity of pattern statistics to parameter estimations. A very natural question is then, how this variability could affect a pattern statistic study, and, as this variability grows with the Markov model order, how to choose this parameter.
We propose here to consider the case of a very simple pattern study: we want to find the 100 most over-represented octamers (DNA words of size 8) in a given genome. Assuming the true parameter π (and hence μ) is known, we can compute REF = {W1,..., W100}, the list of these words (ordered by decreasing statistics, so that the most over-represented one is the first one).
For each estimates μ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcaaaa@2E79@ and π^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcaaaa@2E80@, we can compute REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ the 100 most over-represented octamers in the genome using the statistic S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ and compare it to the truth. In order to do so, we first compute the true positive rate (TP rate) defined by the rate of common words in REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ and REF, and the rank accordance rate (RA rate) defined by the Kendall's tau [[15], Chapter 13] between S and S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ ranks of {REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ ∪ REF}. Such statistic is in the range [-1,1] and has the value 1 for the complete rank accordance and the value -1 for the complete rank discordance.
As in the section "practical case", we consider two genomes: Escherichia coli K12 (ℓ = n = 4639675) and Mycoplasma genitalium (ℓ = n = 580076). For each Markov model order m from 1 to 6, we estimate π on the sequence (by maximum of likelihood), compute the REF list and then draw a sample of REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ from which we get estimates for the expectation of TP and RA rates.
Results are given in tables 4 and 5. We can see that, surprisingly, the TP rate could be very low even for long genome such as E. coli when high order Markov model (m = 6) are used. Of course, these rates are even worse on M. genitalium whose genome is ten times smaller than the first one. It is also clear that the RA rate is more affected by the variability induced by parameter estimation than the TP rate.
Table 4  Mean true positive rate and rank accordance rate in Escherichia coli K12.
Markov order  1  2  3  4  5  6
TP rate  99.0%  98.0%  97.9%  94.4%  82.1%  47.6%
RA rate  99.0%  95.5%  91.5%  83.9%  68.0%  36.5%
× 103  383.33  95.83  23.96  5.99  1.50  0.37
Both quantities are estimated with 1 000 simulations. We consider the 1 00 most over-represented octamers, the sequence length is ℓ = 4639675. The last row gives the sample size per free parameter (length n of the sequence divided by the number km(k - 1) of parameters).
Table 5  Mean true positive rate and rank accordance rate in Mycoplasma genitalium.
Markov order  1  2  3  4  5  6
TP rate  95.5%  93.6%  90.4%  81.8%  66.0%  25.0%
RA rate  92.6%  85.4%  79.8%  66.5%  45.1%  11.0%
× 103  48.33  12.08  3.02  0.76  0.19  0.05
Both quantities are estimated with 1 000 simulations. We consider the 1 00 most over-represented octamers, the sequence length is ℓ = 580076. The last row gives the sample size per free parameter (length n of the sequence divided by the number km(k - 1) of parameters). Based on these results, we conclude that our pattern study requires a sample size per free parameter of at least a few thousands if we want reliable results. In our examples this has for consequence that the Markov order should not be greater than 4 (or 5 at the very most) for E. coli and 3 (or 4 at the very most) on M. genitalium without resulting in important errors.