Simple case Let us start with a simple case: a binary alphabet A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ = {a, b} (k = 2) with an order m = 1 Markov model π = ( 0.3 0.7 0.6 0.4 )       ( 33 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCcqGH9aqpdaqadaqaauaabeqaciaaaeaacqaIWaamcqGGUaGlcqaIZaWmaeaacqaIWaamcqGGUaGlcqaI3aWnaeaacqaIWaamcqGGUaGlcqaI2aGnaeaacqaIWaamcqGGUaGlcqaI0aanaaaacaGLOaGaayzkaaGaaCzcaiaaxMaadaqadaqaaiabiodaZiabiodaZaGaayjkaiaawMcaaaaa@40EC@ which stationary distribution is μ = (6/13,7/13) and we work on a sequence of length n = 10 000. The first thing to do is to compute E and C (see appendix A for details). Now, we consider the pattern W = ababa occurring Nobs = 1221 times in a sequence of length ℓ = n = 10 000. We have p = μ(a) Π (a,b)2 Π (b,a)2 = 8.142 × 10-2     (34) so E MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=ri8fbaa@388C@[N(ababa)] = (ℓ - 4)p = 813.8 ≃ 0.66 × Nobs and hence the pattern is over-represented. Its statistic (using binomial approximation) is S ≃ − log ⁡ 10 ℙ ( ℬ ( ℓ − 5 + 1 , p ) ≥ N o b s ) = 43.74285       ( 35 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaacqWGtbWucqWIdjYocqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGymaeJaeGimaadabeaatuuDJXwAK1uy0HMmaeXbfv3ySLgzG0uy0HgiuD3BaGqbaOGae8xgHaLaeiikaGccdaGae4hlHiKaeiikaGIaeS4eHWMaeyOeI0IaeGynauJaey4kaSIaeGymaeJaeiilaWIaemiCaaNaeiykaKIaeyyzImRaemOta40aaSbaaSqaaGqaaiab99gaVjab9jgaIjab9nhaZbqabaGccqGGPaqkcqGH9aqpcqaI0aancqaIZaWmcqGGUaGlcqaI3aWncqaI0aancqaIYaGmcqaI4aaocqaI1aqncaWLjaGaaCzcaiabcIcaOiabiodaZiabiwda1iabcMcaPaaa@6B0F@ We have Q + = p N obs − 1 ( 1 − p ) ℓ − 4 − N obs ln ⁡ ( 10 ) β ( p , N obs , ℓ − 3 − N obs ) = 193.3258       ( 36 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaahaaWcbeqaaiabgUcaRaaakiabg2da9maalaaabaGaemiCaa3aaWbaaSqabeaacqWGobGtdaWgaaadbaGaee4Ba8MaeeOyaiMaee4CamhabeaaliabgkHiTiabigdaXaaakiabcIcaOiabigdaXiabgkHiTiabdchaWjabcMcaPmaaCaaaleqabaGaeS4eHWMaeyOeI0IaeGinaqJaeyOeI0IaemOta40aaSbaaWqaaiabb+gaVjabbkgaIjabbohaZbqabaaaaaGcbaGagiiBaWMaeiOBa4MaeiikaGIaeGymaeJaeGimaaJaeiykaKccciGae8NSdiMaeiikaGIaemiCaaNaeiilaWIaemOta40aaSbaaSqaaiabb+gaVjabbkgaIjabbohaZbqabaGccqGGSaalcqWItecBcqGHsislcqaIZaWmcqGHsislcqWGobGtdaWgaaWcbaGaee4Ba8MaeeOyaiMaee4CamhabeaakiabcMcaPaaacqGH9aqpcqaIXaqmcqaI5aqocqaIZaWmcqGGUaGlcqaIZaWmcqaIYaGmcqaI1aqncqaI4aaocaWLjaGaaCzcamaabmaabaGaeG4mamJaeGOnaydacaGLOaGaayzkaaaaaa@70C9@ and   t G 0 = [ − 1 E 0 ( a ) − 2 E 0 ( b ) ] = [ − 2.17 × 10 − 5 − 3.71 × 10 − 5 ]       ( 37 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGGaaidaahaaWcbeqaaiabdsha0baaieqakiab=DeahnaaBaaaleaacqaIWaamaeqaaOGaeyypa0ZaamWaaeaafaqabeqacaaabaWaaSaaaeaacqGHsislcqaIXaqmaeaacqWFfbqrdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabbggaHjabcMcaPaaaaeaadaWcaaqaaiabgkHiTiabikdaYaqaaiab=veafnaaBaaaleaacqaIWaamaeqaaOGaeiikaGIaeeOyaiMaeiykaKcaaaaaaiaawUfacaGLDbaacqGH9aqpdaWadaqaauaabeqabiaaaeaacqGHsislcqaIYaGmcqGGUaGlcqaIXaqmcqaI3aWncqGHxdaTcqaIXaqmcqaIWaamdaahaaWcbeqaaiabgkHiTiabiwda1aaaaOqaaiabgkHiTiabiodaZiabc6caUiabiEda3iabigdaXiabgEna0kabigdaXiabicdaWmaaCaaaleqabaGaeyOeI0IaeGynaudaaaaaaOGaay5waiaaw2faaiaaxMaacaWLjaWaaeWaaeaacqaIZaWmcqaI3aWnaiaawIcacaGLPaaaaaa@5FDF@ and   t G 1 = [ 0 2 E 1 ( ab ) 2 E 1 ( ba ) 0 ] = [ 0 6.19 × 10 − 5 6.19 × 10 − 5 0 ]       ( 38 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGGaaidaahaaWcbeqaaiabdsha0baaieqakiab=DeahnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0ZaamWaaeaafaqabeqaeaaaaeaacqaIWaamaeaadaWcaaqaaiabikdaYaqaaiab=veafnaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeeyyaeMaeeOyaiMaeiykaKcaaaqaamaalaaabaGaeGOmaidabaGae8xrau0aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqqGIbGycqqGHbqycqGGPaqkaaaabaGaeGimaadaaaGaay5waiaaw2faaiabg2da9maadmaabaqbaeqabeabaaaabaGaeGimaadabaGaeGOnayJaeiOla4IaeGymaeJaeGyoaKJaey41aqRaeGymaeJaeGimaaZaaWbaaSqabeaacqGHsislcqaI1aqnaaaakeaacqaI2aGncqGGUaGlcqaIXaqmcqaI5aqocqGHxdaTcqaIXaqmcqaIWaamdaahaaWcbeqaaiabgkHiTiabiwda1aaaaOqaaiabicdaWaaaaiaawUfacaGLDbaacaWLjaGaaCzcamaabmaabaGaeG4mamJaeGioaGdacaGLOaGaayzkaaaaaa@629F@ Finally, we get σ = Q +   t G × C × G = 6.1020774       ( 39 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGH9aqpcqWGrbqudaahaaWcbeqaaiabgUcaRaaakmaakaaabaGaeeiiaaYaaWbaaSqabeaaieGacqGF0baDaaacbeGccqqFhbWrcqGHxdaTcqqFdbWqcqGHxdaTcqqFhbWraSqabaGccqGH9aqpcqaI2aGncqGGUaGlcqaIXaqmcqaIWaamcqaIYaGmcqaIWaamcqaI3aWncqaI3aWncqaI0aancaWLjaGaaCzcamaabmaabaGaeG4mamJaeGyoaKdacaGLOaGaayzkaaaaaa@4A0E@ As our pattern statistics is the decimal logarithm of the p-value, σ = 6 means that the ratio of the estimated p-value over the true one could easily range from 10-12 (10-2 × σ) to 1012 (102 × σ) which is huge. We can see on fig. 1 the empirical distribution of SN compared to the theoretical distribution. Even if the two distributions are closely related, an adjustment test (Kolmogorov-Smirnov) shows that they are different. Figure 1 Empirical and theoretical distributions of S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@. A sample of size 10 000 have been used to get the empirical distribution. The solid line represents the density of N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFneVtaaa@383B@(S, σ2). The adjustment test of Kolmogorov-Smirnov give D = 0.023 which corresponds to a p-value of p = 5.3 × 10-5. Nobs = 1221 and n = ℓ = 10 000. In the fig. 2 we compare σ to its estimator σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ for several values of Nobs. We can see that our theoretical values of σ fits very well to the empirical ones. Figure 2 Comparison of σ and σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@. σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ is estimated with a sample of size 1 000 and Nobs takes its values from 900 to 1 900. The solid line represents the theoretical values and the circles the empirical ones. The statistic S is used on the x-axis. n = ℓ = 10 000. The equation (39) gives an explicit expression of σ as a product of two terms. Once the pattern and the true parameter π are fixed, the first term (Q) depends only on ℓ and Nobs while the second one only depends on the length n of the sequence used for the parameter estimation (see appendix C for an explicit expression of σ in the particular case of an order 0 Markov model). To study the variations of σ(n) as a function of n we therefore need to study G(n) and C(n). Using equations (6) and (22) we get that E ( n ) = O ( n ) and G ( n ) = O ( 1 n )       ( 40 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqadaaabaacbeGae8xrauKaeiikaGIaemOBa4MaeiykaKIaeyypa0Jaem4ta8KaeiikaGIaemOBa4MaeiykaKcabaGaeeyyaeMaeeOBa4MaeeizaqgabaGae83raCKaeiikaGIaemOBa4MaeiykaKIaeyypa0Jaem4ta80aaeWaaeaadaWcaaqaaiabigdaXaqaaiabd6gaUbaaaiaawIcacaGLPaaaaaGaaCzcaiaaxMaadaqadaqaaiabisda0iabicdaWaGaayjkaiaawMcaaaaa@4920@ Using equations (57) and (58) in appendix A we also get that C = M + O + t EE with M(n) = O(n2) and O(n) = O(n)     (41) so finally σ ( n ) ≃ σ ˜ ( n ) = Q + × A + B n       ( 42 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGGOaakcqWGUbGBcqGGPaqkcqWIdjYocuWFdpWCgaacaiabcIcaOiabd6gaUjabcMcaPiabg2da9iabdgfarnaaCaaaleqabaGaey4kaScaaOGaey41aq7aaOaaaeaacqWGbbqqcqGHRaWkdaWcaaqaaiabdkeacbqaaiabd6gaUbaaaSqabaGccaWLjaGaaCzcamaabmaabaGaeGinaqJaeGOmaidacaGLOaGaayzkaaaaaa@464C@ for large n, with A = lim ⁡ n → + ∞   t G ( C − O ) G       ( 43 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqGH9aqpdaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaemOBa4MaeyOKH4Qaey4kaSIaeyOhIukabeaakiabbccaGmaaCaaaleqabaGaemiDaqhaaGqabOGae83raCKaeiikaGIae83qamKaeyOeI0Iae83ta8KaeiykaKIae83raCKaaCzcaiaaxMaadaqadaqaaiabisda0iabiodaZaGaayjkaiaawMcaaaaa@46E6@ and B = lim ⁡ n → + ∞ n ×   t G O G       ( 44 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqcqGH9aqpdaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaemOBa4MaeyOKH4Qaey4kaSIaeyOhIukabeaakiabd6gaUjabgEna0kabbccaGmaaCaaaleqabaGaemiDaqhaaGqabOGae83raCKae83ta8Kae83raCKaaCzcaiaaxMaadaqadaqaaiabisda0iabisda0aGaayjkaiaawMcaaaaa@46BC@ We can see on fig. 3 that σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@ is not a very good approximation of σ for small n, but, as the approximation is far easier to compute (and trivial to invert) than the true value, this can be useful when we need to compute a minimum length n to obtain a given σ. Figure 3 Comparison of σ(n) and σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@(n). The circles reprensent σ(n) and the solid line σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@(n). n∞ = 106 have been used to compute the value of A and B. Nobs = 1221 and ℓ = 10 000. We also see on the same figure that σ grows rapidly when n decreases. For example, we get σ ≃ 20 for n = 5000 (while equation (35) gives S ≃ 264.4). As we consider here a binary alphabet (k = 2) and a first order Markov model (m = 1) we have only km(k - 1) = 2 parameters to estimate with a sample of size n = 5000 (so we have 2500 sample per parameter). Although this situation seems quite comfortable, the sensitivity to parameter estimation appears in fact to be so large that we could have a factor 1040 between the true p-value and its estimate.