Results and discussion

Validation

Simple case
Let us start with a simple case: a binary alphabet A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ = {a, b} (k = 2) with an order m = 1 Markov model
π = ( 0.3   0.7    0.6   0.4      )        ( 33  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCcqGH9aqpdaqadaqaauaabeqaciaaaeaacqaIWaamcqGGUaGlcqaIZaWmaeaacqaIWaamcqGGUaGlcqaI3aWnaeaacqaIWaamcqGGUaGlcqaI2aGnaeaacqaIWaamcqGGUaGlcqaI0aanaaaacaGLOaGaayzkaaGaaCzcaiaaxMaadaqadaqaaiabiodaZiabiodaZaGaayjkaiaawMcaaaaa@40EC@
which stationary distribution is μ = (6/13,7/13) and we work on a sequence of length n = 10 000.
The first thing to do is to compute E and C (see appendix A for details).
Now, we consider the pattern W = ababa occurring Nobs = 1221 times in a sequence of length ℓ = n = 10 000. We have
p = μ(a) Π (a,b)2 Π (b,a)2 = 8.142 × 10-2     (34)
so E MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=ri8fbaa@388C@[N(ababa)] = (ℓ - 4)p = 813.8 ≃ 0.66 × Nobs and hence the pattern is over-represented. Its statistic (using binomial approximation) is
S ≃ − log ⁡  10   ℙ ( ℬ ( ℓ − 5 + 1 , p ) ≥ N o b s   ) = 43.74285       ( 35 )   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaacqWGtbWucqWIdjYocqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGymaeJaeGimaadabeaatuuDJXwAK1uy0HMmaeXbfv3ySLgzG0uy0HgiuD3BaGqbaOGae8xgHaLaeiikaGccdaGae4hlHiKaeiikaGIaeS4eHWMaeyOeI0IaeGynauJaey4kaSIaeGymaeJaeiilaWIaemiCaaNaeiykaKIaeyyzImRaemOta40aaSbaaSqaaGqaaiab99gaVjab9jgaIjab9nhaZbqabaGccqGGPaqkcqGH9aqpcqaI0aancqaIZaWmcqGGUaGlcqaI3aWncqaI0aancqaIYaGmcqaI4aaocqaI1aqncaWLjaGaaCzcaiabcIcaOiabiodaZiabiwda1iabcMcaPaaa@6B0F@
We have
Q +  = p N obs   − 1   ( 1 − p )  ℓ − 4 − N obs      ln ⁡ ( 10 ) β ( p , N obs   , ℓ − 3 − N obs   )   = 193.3258       ( 36  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaahaaWcbeqaaiabgUcaRaaakiabg2da9maalaaabaGaemiCaa3aaWbaaSqabeaacqWGobGtdaWgaaadbaGaee4Ba8MaeeOyaiMaee4CamhabeaaliabgkHiTiabigdaXaaakiabcIcaOiabigdaXiabgkHiTiabdchaWjabcMcaPmaaCaaaleqabaGaeS4eHWMaeyOeI0IaeGinaqJaeyOeI0IaemOta40aaSbaaWqaaiabb+gaVjabbkgaIjabbohaZbqabaaaaaGcbaGagiiBaWMaeiOBa4MaeiikaGIaeGymaeJaeGimaaJaeiykaKccciGae8NSdiMaeiikaGIaemiCaaNaeiilaWIaemOta40aaSbaaSqaaiabb+gaVjabbkgaIjabbohaZbqabaGccqGGSaalcqWItecBcqGHsislcqaIZaWmcqGHsislcqWGobGtdaWgaaWcbaGaee4Ba8MaeeOyaiMaee4CamhabeaakiabcMcaPaaacqGH9aqpcqaIXaqmcqaI5aqocqaIZaWmcqGGUaGlcqaIZaWmcqaIYaGmcqaI1aqncqaI4aaocaWLjaGaaCzcamaabmaabaGaeG4mamJaeGOnaydacaGLOaGaayzkaaaaaa@70C9@
and
  t  G 0  = [ − 1  E 0  ( a )     − 2  E 0  ( b )        ]  = [ − 2.17 × 10  − 5     − 3.71 × 10  − 5        ]        ( 37  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGGaaidaahaaWcbeqaaiabdsha0baaieqakiab=DeahnaaBaaaleaacqaIWaamaeqaaOGaeyypa0ZaamWaaeaafaqabeqacaaabaWaaSaaaeaacqGHsislcqaIXaqmaeaacqWFfbqrdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabbggaHjabcMcaPaaaaeaadaWcaaqaaiabgkHiTiabikdaYaqaaiab=veafnaaBaaaleaacqaIWaamaeqaaOGaeiikaGIaeeOyaiMaeiykaKcaaaaaaiaawUfacaGLDbaacqGH9aqpdaWadaqaauaabeqabiaaaeaacqGHsislcqaIYaGmcqGGUaGlcqaIXaqmcqaI3aWncqGHxdaTcqaIXaqmcqaIWaamdaahaaWcbeqaaiabgkHiTiabiwda1aaaaOqaaiabgkHiTiabiodaZiabc6caUiabiEda3iabigdaXiabgEna0kabigdaXiabicdaWmaaCaaaleqabaGaeyOeI0IaeGynaudaaaaaaOGaay5waiaaw2faaiaaxMaacaWLjaWaaeWaaeaacqaIZaWmcqaI3aWnaiaawIcacaGLPaaaaaa@5FDF@
and
  t  G 1  = [ 0  2 E 1  ( ab )     2 E 1  ( ba )     0     ]  = [ 0  6.19 × 10  − 5     6.19 × 10  − 5     0     ]        ( 38  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGGaaidaahaaWcbeqaaiabdsha0baaieqakiab=DeahnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0ZaamWaaeaafaqabeqaeaaaaeaacqaIWaamaeaadaWcaaqaaiabikdaYaqaaiab=veafnaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeeyyaeMaeeOyaiMaeiykaKcaaaqaamaalaaabaGaeGOmaidabaGae8xrau0aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqqGIbGycqqGHbqycqGGPaqkaaaabaGaeGimaadaaaGaay5waiaaw2faaiabg2da9maadmaabaqbaeqabeabaaaabaGaeGimaadabaGaeGOnayJaeiOla4IaeGymaeJaeGyoaKJaey41aqRaeGymaeJaeGimaaZaaWbaaSqabeaacqGHsislcqaI1aqnaaaakeaacqaI2aGncqGGUaGlcqaIXaqmcqaI5aqocqGHxdaTcqaIXaqmcqaIWaamdaahaaWcbeqaaiabgkHiTiabiwda1aaaaOqaaiabicdaWaaaaiaawUfacaGLDbaacaWLjaGaaCzcamaabmaabaGaeG4mamJaeGioaGdacaGLOaGaayzkaaaaaa@629F@
Finally, we get
σ = Q +    t  G × C × G   = 6.1020774       ( 39  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGH9aqpcqWGrbqudaahaaWcbeqaaiabgUcaRaaakmaakaaabaGaeeiiaaYaaWbaaSqabeaaieGacqGF0baDaaacbeGccqqFhbWrcqGHxdaTcqqFdbWqcqGHxdaTcqqFhbWraSqabaGccqGH9aqpcqaI2aGncqGGUaGlcqaIXaqmcqaIWaamcqaIYaGmcqaIWaamcqaI3aWncqaI3aWncqaI0aancaWLjaGaaCzcamaabmaabaGaeG4mamJaeGyoaKdacaGLOaGaayzkaaaaaa@4A0E@
As our pattern statistics is the decimal logarithm of the p-value, σ = 6 means that the ratio of the estimated p-value over the true one could easily range from 10-12 (10-2 × σ) to 1012 (102 × σ) which is huge.
We can see on fig. 1 the empirical distribution of SN compared to the theoretical distribution. Even if the two distributions are closely related, an adjustment test (Kolmogorov-Smirnov) shows that they are different.
Figure 1  Empirical and theoretical distributions of S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@. A sample of size 10 000 have been used to get the empirical distribution. The solid line represents the density of N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFneVtaaa@383B@(S, σ2). The adjustment test of Kolmogorov-Smirnov give D = 0.023 which corresponds to a p-value of p = 5.3 × 10-5. Nobs = 1221 and n = ℓ = 10 000. In the fig. 2 we compare σ to its estimator σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ for several values of Nobs. We can see that our theoretical values of σ fits very well to the empirical ones.
Figure 2  Comparison of σ and σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@. σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ is estimated with a sample of size 1 000 and Nobs takes its values from 900 to 1 900. The solid line represents the theoretical values and the circles the empirical ones. The statistic S is used on the x-axis. n = ℓ = 10 000. The equation (39) gives an explicit expression of σ as a product of two terms. Once the pattern and the true parameter π are fixed, the first term (Q) depends only on ℓ and Nobs while the second one only depends on the length n of the sequence used for the parameter estimation (see appendix C for an explicit expression of σ in the particular case of an order 0 Markov model).
To study the variations of σ(n) as a function of n we therefore need to study G(n) and C(n). Using equations (6) and (22) we get that
E ( n ) = O ( n )   and   G ( n ) = O ( 1 n   )            ( 40  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqadaaabaacbeGae8xrauKaeiikaGIaemOBa4MaeiykaKIaeyypa0Jaem4ta8KaeiikaGIaemOBa4MaeiykaKcabaGaeeyyaeMaeeOBa4MaeeizaqgabaGae83raCKaeiikaGIaemOBa4MaeiykaKIaeyypa0Jaem4ta80aaeWaaeaadaWcaaqaaiabigdaXaqaaiabd6gaUbaaaiaawIcacaGLPaaaaaGaaCzcaiaaxMaadaqadaqaaiabisda0iabicdaWaGaayjkaiaawMcaaaaa@4920@
Using equations (57) and (58) in appendix A we also get that C = M + O + t EE with
M(n) = O(n2) and O(n) = O(n)     (41)
so finally
σ ( n ) ≃ σ ˜  ( n ) = Q +  × A + B n          ( 42  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGGOaakcqWGUbGBcqGGPaqkcqWIdjYocuWFdpWCgaacaiabcIcaOiabd6gaUjabcMcaPiabg2da9iabdgfarnaaCaaaleqabaGaey4kaScaaOGaey41aq7aaOaaaeaacqWGbbqqcqGHRaWkdaWcaaqaaiabdkeacbqaaiabd6gaUbaaaSqabaGccaWLjaGaaCzcamaabmaabaGaeGinaqJaeGOmaidacaGLOaGaayzkaaaaaa@464C@
for large n, with
A = lim ⁡  n → + ∞     t  G ( C − O ) G       ( 43  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqGH9aqpdaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaemOBa4MaeyOKH4Qaey4kaSIaeyOhIukabeaakiabbccaGmaaCaaaleqabaGaemiDaqhaaGqabOGae83raCKaeiikaGIae83qamKaeyOeI0Iae83ta8KaeiykaKIae83raCKaaCzcaiaaxMaadaqadaqaaiabisda0iabiodaZaGaayjkaiaawMcaaaaa@46E6@
and
B = lim ⁡  n → + ∞   n ×   t  G O G       ( 44  )    MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqcqGH9aqpdaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaemOBa4MaeyOKH4Qaey4kaSIaeyOhIukabeaakiabd6gaUjabgEna0kabbccaGmaaCaaaleqabaGaemiDaqhaaGqabOGae83raCKae83ta8Kae83raCKaaCzcaiaaxMaadaqadaqaaiabisda0iabisda0aGaayjkaiaawMcaaaaa@46BC@
We can see on fig. 3 that σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@ is not a very good approximation of σ for small n, but, as the approximation is far easier to compute (and trivial to invert) than the true value, this can be useful when we need to compute a minimum length n to obtain a given σ.
Figure 3  Comparison of σ(n) and σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@(n). The circles reprensent σ(n) and the solid line σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@(n). n∞ = 106 have been used to compute the value of A and B. Nobs = 1221 and ℓ = 10 000. We also see on the same figure that σ grows rapidly when n decreases. For example, we get σ ≃ 20 for n = 5000 (while equation (35) gives S ≃ 264.4).
As we consider here a binary alphabet (k = 2) and a first order Markov model (m = 1) we have only km(k - 1) = 2 parameters to estimate with a sample of size n = 5000 (so we have 2500 sample per parameter). Although this situation seems quite comfortable, the sensitivity to parameter estimation appears in fact to be so large that we could have a factor 1040 between the true p-value and its estimate.

Practical case
We have seen with our first example that our approximation works very well in a simple case. Will this hold with more practical cases?
To answer this question, let us consider the following experimental design:
• one pattern: W = acgtacgt;
• two genomes: Escherichia coli K12 (ℓ = n = 4639675) and Mycoplasma genitalium (ℓ = n = 580076);
• five Markov orders: m = 1 to m = 5 (larger m are not considered since the computation of C becomes then intractable).
As the sequence lengths and compositions of the two considered genomes differ a lot, we have to take a different value of Nobs for each organism: Nobs = 30 for M. genitalium and Nobs = 150 for E. coli. Proceeding as indicated in section "simulations", we use the algorithm 1 for each experiment.
Table 1  Comparison of theoretical and empirical pattern statistic mean and standard deviation on Escherichia coli K12.
m   S   σ  S ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@  σ ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@
1  35.57  0.28  35.57  0.27
2  31.61  0.49  31.60  0.50
3  46.75  1.04  46.77  1.03
4  45.33  1.74  45.32  1.81
5  62.27  3.45  62.36  3.34
We consider the pattern W = acgtacgt with Nobs = 150. The sequence length is ℓ = 4639675, we use an order m Markov model and a sample of size M = 1 000.
Table 2  Comparison of theoretical and empirical pattern statistic mean and standard deviation on Mycoplasma genitalium.
m   S   σ  S ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@  σ ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@
1  42.48  0.38  42.47  0.40
2  44.62  0.78  44.62  0.81
3  55.96  1.49  56.02  1.52
4  55.06  3.39  55.48  3.48
5  56.49  10.35  57.21*  9.09*
We consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580 076, we use an order m Markov model and a sample of size M = 1 000. (*) for 123 terms in the sample we got P (N) = 0 and hence, SN was not computed. Algorithm 1 simulations for one experiment in the practical case
1: estimate the order m parameter π (and μ) from the original sequence. Although these parameters are estimated, they are considered as the true parameters;
2: compute S = -log10 ℙ(N ≥ Nobs);
3: compute σ using approximation (23)
4: for j = 1 ... 1 000 do
5: draw a random sequence Y = Y1 ... Yn according to and order m stationary Markov model of parameter π;
6: compute N the frequency vector of all size m and size m + 1 words in Y;
7: compute Sj = SN = -log10 ℙ(N ≥ Nobs);
8: end for
9: compute S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ (resp. σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@) the mean (resp. standard deviation) of the sample S1,..., Sj.
We can see on table 1 the results for E. coli. For each Markov model considered, our approximation of σ is very close to the empiric ones and, as with figure 1, the Gaussian distribution fit well to the empiric one (data not shown). Table 2 shows the same behaviour with M. genitalium except for m = 5 where σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ differs slightly more than in the other cases from its theoretical value. To understand this phenomenon, let us first recall the expression of P(N) for m = 5 using equation (15):
P ( N ) = N 1  ( agctac ) × N 1  ( gctacg ) × N 1  ( ctacgt )  ( ℓ − m + 1 ) × N 0  ( gctac ) × N 0  ( ctacg )     MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakieqacqWFobGtcqGGPaqkcqGH9aqpdaWcaaqaaiab=5eaonaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeeyyaeMaee4zaCMaee4yamMaeeiDaqNaeeyyaeMaee4yamMaeiykaKIaey41aqRae8Nta40aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqqGNbWzcqqGJbWycqqG0baDcqqGHbqycqqGJbWycqqGNbWzcqGGPaqkcqGHxdaTcqWFobGtdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabbogaJjabbsha0jabbggaHjabbogaJjabbEgaNjabbsha0jabcMcaPaqaaiabcIcaOiabloriSjabgkHiTiabd2gaTjabgUcaRiabigdaXiabcMcaPiabgEna0kab=5eaonaaBaaaleaacqaIWaamaeqaaOGaeiikaGIaee4zaCMaee4yamMaeeiDaqNaeeyyaeMaee4yamMaeiykaKIaey41aqRae8Nta40aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqqGJbWycqqG0baDcqqGHbqycqqGJbWycqqGNbWzcqGGPaqkaaaaaa@7A52@
and as ℙ(N1 (agctac) = 0) ≃ 2.26 × 10-6, ℙ(N1 (gctacg) = 0) ≃ 1.35 × 10-1 and ℙ(N1 (ctacgt) = 0) ≃ 1.24 × 10-4 we will have P(N) = 0 roughly 14% of the time. This happened 123 times in our sample of size 1 000, each time preventing to compute SN. The sample is hence biased and S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ and σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ are therefore not accurate.
What happen now if we use another statistical method to compute the pattern statistics. As the binomial approximation is supposed to be close to the exact solution, we expect the standard deviation obtained with other statistical methods to remain close to σ. In table 3, we compare the empirical results using binomial approximations (like above) but also compound Poisson or large deviations approximations. Both empirical means and standard deviations are close to the theoretical ones thus validating the method.
Table 3  Comparison of theoretical and empirical pattern statistics mean and deviation on Mycoplasma genitalium.
theoretical  binomial  compound Poisson  large deviations
S   σ  S ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@  σ ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@  S ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@  σ ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@  S ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@  σ ^   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@
55.96  1.49  56.05  1.47  55.42  1.45  54.27  1.43
We consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580076, we use an order m = 3 Markov model and a sample of size M = 1 000. The pattern statistics are computed (from left to right) through binomial, compound Poisson or large deviations approximations.

Choice of a Markov model order
Through the computation of σ we can measure the sensitivity of pattern statistics to parameter estimations. A very natural question is then, how this variability could affect a pattern statistic study, and, as this variability grows with the Markov model order, how to choose this parameter.
We propose here to consider the case of a very simple pattern study: we want to find the 100 most over-represented octamers (DNA words of size 8) in a given genome. Assuming the true parameter π (and hence μ) is known, we can compute REF = {W1,..., W100}, the list of these words (ordered by decreasing statistics, so that the most over-represented one is the first one).
For each estimates μ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcaaaa@2E79@ and π^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcaaaa@2E80@, we can compute REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ the 100 most over-represented octamers in the genome using the statistic S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ and compare it to the truth. In order to do so, we first compute the true positive rate (TP rate) defined by the rate of common words in REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ and REF, and the rank accordance rate (RA rate) defined by the Kendall's tau [[15], Chapter 13] between S and S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ ranks of {REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ ∪ REF}. Such statistic is in the range [-1,1] and has the value 1 for the complete rank accordance and the value -1 for the complete rank discordance.
As in the section "practical case", we consider two genomes: Escherichia coli K12 (ℓ = n = 4639675) and Mycoplasma genitalium (ℓ = n = 580076). For each Markov model order m from 1 to 6, we estimate π on the sequence (by maximum of likelihood), compute the REF list and then draw a sample of REF_ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqiaaqaaiabbkfasjabbweafjabbAeagbGaayPadaaaaa@30BD@ from which we get estimates for the expectation of TP and RA rates.
Results are given in tables 4 and 5. We can see that, surprisingly, the TP rate could be very low even for long genome such as E. coli when high order Markov model (m = 6) are used. Of course, these rates are even worse on M. genitalium whose genome is ten times smaller than the first one. It is also clear that the RA rate is more affected by the variability induced by parameter estimation than the TP rate.
Table 4  Mean true positive rate and rank accordance rate in Escherichia coli K12.
Markov order  1  2  3  4  5  6
TP rate  99.0%  98.0%  97.9%  94.4%  82.1%  47.6%
RA rate  99.0%  95.5%  91.5%  83.9%  68.0%  36.5%
× 103  383.33  95.83  23.96  5.99  1.50  0.37
Both quantities are estimated with 1 000 simulations. We consider the 1 00 most over-represented octamers, the sequence length is ℓ = 4639675. The last row gives the sample size per free parameter (length n of the sequence divided by the number km(k - 1) of parameters).
Table 5  Mean true positive rate and rank accordance rate in Mycoplasma genitalium.
Markov order  1  2  3  4  5  6
TP rate  95.5%  93.6%  90.4%  81.8%  66.0%  25.0%
RA rate  92.6%  85.4%  79.8%  66.5%  45.1%  11.0%
× 103  48.33  12.08  3.02  0.76  0.19  0.05
Both quantities are estimated with 1 000 simulations. We consider the 1 00 most over-represented octamers, the sequence length is ℓ = 580076. The last row gives the sample size per free parameter (length n of the sequence divided by the number km(k - 1) of parameters). Based on these results, we conclude that our pattern study requires a sample size per free parameter of at least a few thousands if we want reliable results. In our examples this has for consequence that the Markov order should not be greater than 4 (or 5 at the very most) for E. coli and 3 (or 4 at the very most) on M. genitalium without resulting in important errors.