PMC:1647278 / 42959-65578
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/1647278","sourcedb":"PMC","sourceid":"1647278","source_url":"https://www.ncbi.nlm.nih.gov/pmc/1647278","text":"Validation\n\nSimple case\nLet us start with a simple case: a binary alphabet A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ = {a, b} (k = 2) with an order m = 1 Markov model\nπ = ( 0.3 0.7 0.6 0.4 ) ( 33 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCcqGH9aqpdaqadaqaauaabeqaciaaaeaacqaIWaamcqGGUaGlcqaIZaWmaeaacqaIWaamcqGGUaGlcqaI3aWnaeaacqaIWaamcqGGUaGlcqaI2aGnaeaacqaIWaamcqGGUaGlcqaI0aanaaaacaGLOaGaayzkaaGaaCzcaiaaxMaadaqadaqaaiabiodaZiabiodaZaGaayjkaiaawMcaaaaa@40EC@\nwhich stationary distribution is μ = (6/13,7/13) and we work on a sequence of length n = 10 000.\nThe first thing to do is to compute E and C (see appendix A for details).\nNow, we consider the pattern W = ababa occurring Nobs = 1221 times in a sequence of length ℓ = n = 10 000. We have\np = μ(a) Π (a,b)2 Π (b,a)2 = 8.142 × 10-2 (34)\nso E MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=ri8fbaa@388C@[N(ababa)] = (ℓ - 4)p = 813.8 ≃ 0.66 × Nobs and hence the pattern is over-represented. Its statistic (using binomial approximation) is\nS ≃ − log 10 ℙ ( ℬ ( ℓ − 5 + 1 , p ) ≥ N o b s ) = 43.74285 ( 35 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaacqWGtbWucqWIdjYocqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGymaeJaeGimaadabeaatuuDJXwAK1uy0HMmaeXbfv3ySLgzG0uy0HgiuD3BaGqbaOGae8xgHaLaeiikaGccdaGae4hlHiKaeiikaGIaeS4eHWMaeyOeI0IaeGynauJaey4kaSIaeGymaeJaeiilaWIaemiCaaNaeiykaKIaeyyzImRaemOta40aaSbaaSqaaGqaaiab99gaVjab9jgaIjab9nhaZbqabaGccqGGPaqkcqGH9aqpcqaI0aancqaIZaWmcqGGUaGlcqaI3aWncqaI0aancqaIYaGmcqaI4aaocqaI1aqncaWLjaGaaCzcaiabcIcaOiabiodaZiabiwda1iabcMcaPaaa@6B0F@\nWe have\nQ + = p N obs − 1 ( 1 − p ) ℓ − 4 − N obs ln ( 10 ) β ( p , N obs , ℓ − 3 − N obs ) = 193.3258 ( 36 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaahaaWcbeqaaiabgUcaRaaakiabg2da9maalaaabaGaemiCaa3aaWbaaSqabeaacqWGobGtdaWgaaadbaGaee4Ba8MaeeOyaiMaee4CamhabeaaliabgkHiTiabigdaXaaakiabcIcaOiabigdaXiabgkHiTiabdchaWjabcMcaPmaaCaaaleqabaGaeS4eHWMaeyOeI0IaeGinaqJaeyOeI0IaemOta40aaSbaaWqaaiabb+gaVjabbkgaIjabbohaZbqabaaaaaGcbaGagiiBaWMaeiOBa4MaeiikaGIaeGymaeJaeGimaaJaeiykaKccciGae8NSdiMaeiikaGIaemiCaaNaeiilaWIaemOta40aaSbaaSqaaiabb+gaVjabbkgaIjabbohaZbqabaGccqGGSaalcqWItecBcqGHsislcqaIZaWmcqGHsislcqWGobGtdaWgaaWcbaGaee4Ba8MaeeOyaiMaee4CamhabeaakiabcMcaPaaacqGH9aqpcqaIXaqmcqaI5aqocqaIZaWmcqGGUaGlcqaIZaWmcqaIYaGmcqaI1aqncqaI4aaocaWLjaGaaCzcamaabmaabaGaeG4mamJaeGOnaydacaGLOaGaayzkaaaaaa@70C9@\nand\n t G 0 = [ − 1 E 0 ( a ) − 2 E 0 ( b ) ] = [ − 2.17 × 10 − 5 − 3.71 × 10 − 5 ] ( 37 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGGaaidaahaaWcbeqaaiabdsha0baaieqakiab=DeahnaaBaaaleaacqaIWaamaeqaaOGaeyypa0ZaamWaaeaafaqabeqacaaabaWaaSaaaeaacqGHsislcqaIXaqmaeaacqWFfbqrdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabbggaHjabcMcaPaaaaeaadaWcaaqaaiabgkHiTiabikdaYaqaaiab=veafnaaBaaaleaacqaIWaamaeqaaOGaeiikaGIaeeOyaiMaeiykaKcaaaaaaiaawUfacaGLDbaacqGH9aqpdaWadaqaauaabeqabiaaaeaacqGHsislcqaIYaGmcqGGUaGlcqaIXaqmcqaI3aWncqGHxdaTcqaIXaqmcqaIWaamdaahaaWcbeqaaiabgkHiTiabiwda1aaaaOqaaiabgkHiTiabiodaZiabc6caUiabiEda3iabigdaXiabgEna0kabigdaXiabicdaWmaaCaaaleqabaGaeyOeI0IaeGynaudaaaaaaOGaay5waiaaw2faaiaaxMaacaWLjaWaaeWaaeaacqaIZaWmcqaI3aWnaiaawIcacaGLPaaaaaa@5FDF@\nand\n t G 1 = [ 0 2 E 1 ( ab ) 2 E 1 ( ba ) 0 ] = [ 0 6.19 × 10 − 5 6.19 × 10 − 5 0 ] ( 38 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGGaaidaahaaWcbeqaaiabdsha0baaieqakiab=DeahnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0ZaamWaaeaafaqabeqaeaaaaeaacqaIWaamaeaadaWcaaqaaiabikdaYaqaaiab=veafnaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeeyyaeMaeeOyaiMaeiykaKcaaaqaamaalaaabaGaeGOmaidabaGae8xrau0aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqqGIbGycqqGHbqycqGGPaqkaaaabaGaeGimaadaaaGaay5waiaaw2faaiabg2da9maadmaabaqbaeqabeabaaaabaGaeGimaadabaGaeGOnayJaeiOla4IaeGymaeJaeGyoaKJaey41aqRaeGymaeJaeGimaaZaaWbaaSqabeaacqGHsislcqaI1aqnaaaakeaacqaI2aGncqGGUaGlcqaIXaqmcqaI5aqocqGHxdaTcqaIXaqmcqaIWaamdaahaaWcbeqaaiabgkHiTiabiwda1aaaaOqaaiabicdaWaaaaiaawUfacaGLDbaacaWLjaGaaCzcamaabmaabaGaeG4mamJaeGioaGdacaGLOaGaayzkaaaaaa@629F@\nFinally, we get\nσ = Q + t G × C × G = 6.1020774 ( 39 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGH9aqpcqWGrbqudaahaaWcbeqaaiabgUcaRaaakmaakaaabaGaeeiiaaYaaWbaaSqabeaaieGacqGF0baDaaacbeGccqqFhbWrcqGHxdaTcqqFdbWqcqGHxdaTcqqFhbWraSqabaGccqGH9aqpcqaI2aGncqGGUaGlcqaIXaqmcqaIWaamcqaIYaGmcqaIWaamcqaI3aWncqaI3aWncqaI0aancaWLjaGaaCzcamaabmaabaGaeG4mamJaeGyoaKdacaGLOaGaayzkaaaaaa@4A0E@\nAs our pattern statistics is the decimal logarithm of the p-value, σ = 6 means that the ratio of the estimated p-value over the true one could easily range from 10-12 (10-2 × σ) to 1012 (102 × σ) which is huge.\nWe can see on fig. 1 the empirical distribution of SN compared to the theoretical distribution. Even if the two distributions are closely related, an adjustment test (Kolmogorov-Smirnov) shows that they are different.\nFigure 1 Empirical and theoretical distributions of S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@. A sample of size 10 000 have been used to get the empirical distribution. The solid line represents the density of N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFneVtaaa@383B@(S, σ2). The adjustment test of Kolmogorov-Smirnov give D = 0.023 which corresponds to a p-value of p = 5.3 × 10-5. Nobs = 1221 and n = ℓ = 10 000. In the fig. 2 we compare σ to its estimator σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ for several values of Nobs. We can see that our theoretical values of σ fits very well to the empirical ones.\nFigure 2 Comparison of σ and σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@. σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ is estimated with a sample of size 1 000 and Nobs takes its values from 900 to 1 900. The solid line represents the theoretical values and the circles the empirical ones. The statistic S is used on the x-axis. n = ℓ = 10 000. The equation (39) gives an explicit expression of σ as a product of two terms. Once the pattern and the true parameter π are fixed, the first term (Q) depends only on ℓ and Nobs while the second one only depends on the length n of the sequence used for the parameter estimation (see appendix C for an explicit expression of σ in the particular case of an order 0 Markov model).\nTo study the variations of σ(n) as a function of n we therefore need to study G(n) and C(n). Using equations (6) and (22) we get that\nE ( n ) = O ( n ) and G ( n ) = O ( 1 n ) ( 40 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqadaaabaacbeGae8xrauKaeiikaGIaemOBa4MaeiykaKIaeyypa0Jaem4ta8KaeiikaGIaemOBa4MaeiykaKcabaGaeeyyaeMaeeOBa4MaeeizaqgabaGae83raCKaeiikaGIaemOBa4MaeiykaKIaeyypa0Jaem4ta80aaeWaaeaadaWcaaqaaiabigdaXaqaaiabd6gaUbaaaiaawIcacaGLPaaaaaGaaCzcaiaaxMaadaqadaqaaiabisda0iabicdaWaGaayjkaiaawMcaaaaa@4920@\nUsing equations (57) and (58) in appendix A we also get that C = M + O + t EE with\nM(n) = O(n2) and O(n) = O(n) (41)\nso finally\nσ ( n ) ≃ σ ˜ ( n ) = Q + × A + B n ( 42 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCcqGGOaakcqWGUbGBcqGGPaqkcqWIdjYocuWFdpWCgaacaiabcIcaOiabd6gaUjabcMcaPiabg2da9iabdgfarnaaCaaaleqabaGaey4kaScaaOGaey41aq7aaOaaaeaacqWGbbqqcqGHRaWkdaWcaaqaaiabdkeacbqaaiabd6gaUbaaaSqabaGccaWLjaGaaCzcamaabmaabaGaeGinaqJaeGOmaidacaGLOaGaayzkaaaaaa@464C@\nfor large n, with\nA = lim n → + ∞ t G ( C − O ) G ( 43 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqGH9aqpdaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaemOBa4MaeyOKH4Qaey4kaSIaeyOhIukabeaakiabbccaGmaaCaaaleqabaGaemiDaqhaaGqabOGae83raCKaeiikaGIae83qamKaeyOeI0Iae83ta8KaeiykaKIae83raCKaaCzcaiaaxMaadaqadaqaaiabisda0iabiodaZaGaayjkaiaawMcaaaaa@46E6@\nand\nB = lim n → + ∞ n × t G O G ( 44 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqcqGH9aqpdaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaemOBa4MaeyOKH4Qaey4kaSIaeyOhIukabeaakiabd6gaUjabgEna0kabbccaGmaaCaaaleqabaGaemiDaqhaaGqabOGae83raCKae83ta8Kae83raCKaaCzcaiaaxMaadaqadaqaaiabisda0iabisda0aGaayjkaiaawMcaaaaa@46BC@\nWe can see on fig. 3 that σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@ is not a very good approximation of σ for small n, but, as the approximation is far easier to compute (and trivial to invert) than the true value, this can be useful when we need to compute a minimum length n to obtain a given σ.\nFigure 3 Comparison of σ(n) and σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@(n). The circles reprensent σ(n) and the solid line σ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaacaaaa@2E85@(n). n∞ = 106 have been used to compute the value of A and B. Nobs = 1221 and ℓ = 10 000. We also see on the same figure that σ grows rapidly when n decreases. For example, we get σ ≃ 20 for n = 5000 (while equation (35) gives S ≃ 264.4).\nAs we consider here a binary alphabet (k = 2) and a first order Markov model (m = 1) we have only km(k - 1) = 2 parameters to estimate with a sample of size n = 5000 (so we have 2500 sample per parameter). Although this situation seems quite comfortable, the sensitivity to parameter estimation appears in fact to be so large that we could have a factor 1040 between the true p-value and its estimate.\n\nPractical case\nWe have seen with our first example that our approximation works very well in a simple case. Will this hold with more practical cases?\nTo answer this question, let us consider the following experimental design:\n• one pattern: W = acgtacgt;\n• two genomes: Escherichia coli K12 (ℓ = n = 4639675) and Mycoplasma genitalium (ℓ = n = 580076);\n• five Markov orders: m = 1 to m = 5 (larger m are not considered since the computation of C becomes then intractable).\nAs the sequence lengths and compositions of the two considered genomes differ a lot, we have to take a different value of Nobs for each organism: Nobs = 30 for M. genitalium and Nobs = 150 for E. coli. Proceeding as indicated in section \"simulations\", we use the algorithm 1 for each experiment.\nTable 1 Comparison of theoretical and empirical pattern statistic mean and standard deviation on Escherichia coli K12.\nm S σ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@\n1 35.57 0.28 35.57 0.27\n2 31.61 0.49 31.60 0.50\n3 46.75 1.04 46.77 1.03\n4 45.33 1.74 45.32 1.81\n5 62.27 3.45 62.36 3.34\nWe consider the pattern W = acgtacgt with Nobs = 150. The sequence length is ℓ = 4639675, we use an order m Markov model and a sample of size M = 1 000.\nTable 2 Comparison of theoretical and empirical pattern statistic mean and standard deviation on Mycoplasma genitalium.\nm S σ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@\n1 42.48 0.38 42.47 0.40\n2 44.62 0.78 44.62 0.81\n3 55.96 1.49 56.02 1.52\n4 55.06 3.39 55.48 3.48\n5 56.49 10.35 57.21* 9.09*\nWe consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580 076, we use an order m Markov model and a sample of size M = 1 000. (*) for 123 terms in the sample we got P (N) = 0 and hence, SN was not computed. Algorithm 1 simulations for one experiment in the practical case\n1: estimate the order m parameter π (and μ) from the original sequence. Although these parameters are estimated, they are considered as the true parameters;\n2: compute S = -log10 ℙ(N ≥ Nobs);\n3: compute σ using approximation (23)\n4: for j = 1 ... 1 000 do\n5: draw a random sequence Y = Y1 ... Yn according to and order m stationary Markov model of parameter π;\n6: compute N the frequency vector of all size m and size m + 1 words in Y;\n7: compute Sj = SN = -log10 ℙ(N ≥ Nobs);\n8: end for\n9: compute S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ (resp. σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@) the mean (resp. standard deviation) of the sample S1,..., Sj.\nWe can see on table 1 the results for E. coli. For each Markov model considered, our approximation of σ is very close to the empiric ones and, as with figure 1, the Gaussian distribution fit well to the empiric one (data not shown). Table 2 shows the same behaviour with M. genitalium except for m = 5 where σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ differs slightly more than in the other cases from its theoretical value. To understand this phenomenon, let us first recall the expression of P(N) for m = 5 using equation (15):\nP ( N ) = N 1 ( agctac ) × N 1 ( gctacg ) × N 1 ( ctacgt ) ( ℓ − m + 1 ) × N 0 ( gctac ) × N 0 ( ctacg ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakieqacqWFobGtcqGGPaqkcqGH9aqpdaWcaaqaaiab=5eaonaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeeyyaeMaee4zaCMaee4yamMaeeiDaqNaeeyyaeMaee4yamMaeiykaKIaey41aqRae8Nta40aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqqGNbWzcqqGJbWycqqG0baDcqqGHbqycqqGJbWycqqGNbWzcqGGPaqkcqGHxdaTcqWFobGtdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabbogaJjabbsha0jabbggaHjabbogaJjabbEgaNjabbsha0jabcMcaPaqaaiabcIcaOiabloriSjabgkHiTiabd2gaTjabgUcaRiabigdaXiabcMcaPiabgEna0kab=5eaonaaBaaaleaacqaIWaamaeqaaOGaeiikaGIaee4zaCMaee4yamMaeeiDaqNaeeyyaeMaee4yamMaeiykaKIaey41aqRae8Nta40aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqqGJbWycqqG0baDcqqGHbqycqqGJbWycqqGNbWzcqGGPaqkaaaaaa@7A52@\nand as ℙ(N1 (agctac) = 0) ≃ 2.26 × 10-6, ℙ(N1 (gctacg) = 0) ≃ 1.35 × 10-1 and ℙ(N1 (ctacgt) = 0) ≃ 1.24 × 10-4 we will have P(N) = 0 roughly 14% of the time. This happened 123 times in our sample of size 1 000, each time preventing to compute SN. The sample is hence biased and S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ and σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ are therefore not accurate.\nWhat happen now if we use another statistical method to compute the pattern statistics. As the binomial approximation is supposed to be close to the exact solution, we expect the standard deviation obtained with other statistical methods to remain close to σ. In table 3, we compare the empirical results using binomial approximations (like above) but also compound Poisson or large deviations approximations. Both empirical means and standard deviations are close to the theoretical ones thus validating the method.\nTable 3 Comparison of theoretical and empirical pattern statistics mean and deviation on Mycoplasma genitalium.\ntheoretical binomial compound Poisson large deviations\nS σ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@\n55.96 1.49 56.05 1.47 55.42 1.45 54.27 1.43\nWe consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580076, we use an order m = 3 Markov model and a sample of size M = 1 000. The pattern statistics are computed (from left to right) through binomial, compound Poisson or large deviations approximations.","divisions":[{"label":"title","span":{"begin":0,"end":10}},{"label":"sec","span":{"begin":12,"end":13414}},{"label":"title","span":{"begin":12,"end":23}},{"label":"p","span":{"begin":24,"end":428}},{"label":"p","span":{"begin":429,"end":957}},{"label":"p","span":{"begin":958,"end":1054}},{"label":"p","span":{"begin":1055,"end":1128}},{"label":"p","span":{"begin":1129,"end":1243}},{"label":"p","span":{"begin":1244,"end":1294}},{"label":"p","span":{"begin":1295,"end":1736}},{"label":"p","span":{"begin":1737,"end":2521}},{"label":"p","span":{"begin":2522,"end":2529}},{"label":"p","span":{"begin":2530,"end":3553}},{"label":"p","span":{"begin":3554,"end":3557}},{"label":"p","span":{"begin":3558,"end":4544}},{"label":"p","span":{"begin":4545,"end":4548}},{"label":"p","span":{"begin":4549,"end":5574}},{"label":"p","span":{"begin":5575,"end":5590}},{"label":"p","span":{"begin":5591,"end":6182}},{"label":"p","span":{"begin":6183,"end":6393}},{"label":"p","span":{"begin":6394,"end":6611}},{"label":"figure","span":{"begin":6612,"end":7497}},{"label":"label","span":{"begin":6612,"end":6620}},{"label":"caption","span":{"begin":6622,"end":7497}},{"label":"p","span":{"begin":6622,"end":7497}},{"label":"p","span":{"begin":7498,"end":7921}},{"label":"figure","span":{"begin":7922,"end":8718}},{"label":"label","span":{"begin":7922,"end":7930}},{"label":"caption","span":{"begin":7932,"end":8718}},{"label":"p","span":{"begin":7932,"end":8718}},{"label":"p","span":{"begin":8719,"end":9096}},{"label":"p","span":{"begin":9097,"end":9230}},{"label":"p","span":{"begin":9231,"end":9835}},{"label":"p","span":{"begin":9836,"end":9918}},{"label":"p","span":{"begin":9919,"end":9956}},{"label":"p","span":{"begin":9957,"end":9967}},{"label":"p","span":{"begin":9968,"end":10526}},{"label":"p","span":{"begin":10527,"end":10544}},{"label":"p","span":{"begin":10545,"end":11092}},{"label":"p","span":{"begin":11093,"end":11096}},{"label":"p","span":{"begin":11097,"end":11624}},{"label":"p","span":{"begin":11625,"end":12150}},{"label":"figure","span":{"begin":12151,"end":12863}},{"label":"label","span":{"begin":12151,"end":12159}},{"label":"caption","span":{"begin":12161,"end":12863}},{"label":"p","span":{"begin":12161,"end":12863}},{"label":"p","span":{"begin":12864,"end":13012}},{"label":"p","span":{"begin":13013,"end":13414}},{"label":"title","span":{"begin":13416,"end":13430}},{"label":"p","span":{"begin":13431,"end":13565}},{"label":"p","span":{"begin":13566,"end":13641}},{"label":"p","span":{"begin":13642,"end":13670}},{"label":"p","span":{"begin":13671,"end":13768}},{"label":"p","span":{"begin":13769,"end":13888}},{"label":"p","span":{"begin":13889,"end":14184}},{"label":"table-wrap","span":{"begin":14185,"end":15151}},{"label":"label","span":{"begin":14185,"end":14192}},{"label":"caption","span":{"begin":14194,"end":14304}},{"label":"p","span":{"begin":14194,"end":14304}},{"label":"table","span":{"begin":14305,"end":14998}},{"label":"tr","span":{"begin":14305,"end":14858}},{"label":"td","span":{"begin":14305,"end":14307}},{"label":"td","span":{"begin":14308,"end":14310}},{"label":"td","span":{"begin":14312,"end":14314}},{"label":"td","span":{"begin":14316,"end":14584}},{"label":"td","span":{"begin":14586,"end":14858}},{"label":"tr","span":{"begin":14859,"end":14886}},{"label":"td","span":{"begin":14859,"end":14860}},{"label":"td","span":{"begin":14862,"end":14867}},{"label":"td","span":{"begin":14869,"end":14873}},{"label":"td","span":{"begin":14875,"end":14880}},{"label":"td","span":{"begin":14882,"end":14886}},{"label":"tr","span":{"begin":14887,"end":14914}},{"label":"td","span":{"begin":14887,"end":14888}},{"label":"td","span":{"begin":14890,"end":14895}},{"label":"td","span":{"begin":14897,"end":14901}},{"label":"td","span":{"begin":14903,"end":14908}},{"label":"td","span":{"begin":14910,"end":14914}},{"label":"tr","span":{"begin":14915,"end":14942}},{"label":"td","span":{"begin":14915,"end":14916}},{"label":"td","span":{"begin":14918,"end":14923}},{"label":"td","span":{"begin":14925,"end":14929}},{"label":"td","span":{"begin":14931,"end":14936}},{"label":"td","span":{"begin":14938,"end":14942}},{"label":"tr","span":{"begin":14943,"end":14970}},{"label":"td","span":{"begin":14943,"end":14944}},{"label":"td","span":{"begin":14946,"end":14951}},{"label":"td","span":{"begin":14953,"end":14957}},{"label":"td","span":{"begin":14959,"end":14964}},{"label":"td","span":{"begin":14966,"end":14970}},{"label":"tr","span":{"begin":14971,"end":14998}},{"label":"td","span":{"begin":14971,"end":14972}},{"label":"td","span":{"begin":14974,"end":14979}},{"label":"td","span":{"begin":14981,"end":14985}},{"label":"td","span":{"begin":14987,"end":14992}},{"label":"td","span":{"begin":14994,"end":14998}},{"label":"table-wrap-foot","span":{"begin":14999,"end":15151}},{"label":"p","span":{"begin":14999,"end":15151}},{"label":"table-wrap","span":{"begin":15152,"end":16202}},{"label":"label","span":{"begin":15152,"end":15159}},{"label":"caption","span":{"begin":15161,"end":15272}},{"label":"p","span":{"begin":15161,"end":15272}},{"label":"table","span":{"begin":15273,"end":15969}},{"label":"tr","span":{"begin":15273,"end":15826}},{"label":"td","span":{"begin":15273,"end":15275}},{"label":"td","span":{"begin":15276,"end":15278}},{"label":"td","span":{"begin":15280,"end":15282}},{"label":"td","span":{"begin":15284,"end":15552}},{"label":"td","span":{"begin":15554,"end":15826}},{"label":"tr","span":{"begin":15827,"end":15854}},{"label":"td","span":{"begin":15827,"end":15828}},{"label":"td","span":{"begin":15830,"end":15835}},{"label":"td","span":{"begin":15837,"end":15841}},{"label":"td","span":{"begin":15843,"end":15848}},{"label":"td","span":{"begin":15850,"end":15854}},{"label":"tr","span":{"begin":15855,"end":15882}},{"label":"td","span":{"begin":15855,"end":15856}},{"label":"td","span":{"begin":15858,"end":15863}},{"label":"td","span":{"begin":15865,"end":15869}},{"label":"td","span":{"begin":15871,"end":15876}},{"label":"td","span":{"begin":15878,"end":15882}},{"label":"tr","span":{"begin":15883,"end":15910}},{"label":"td","span":{"begin":15883,"end":15884}},{"label":"td","span":{"begin":15886,"end":15891}},{"label":"td","span":{"begin":15893,"end":15897}},{"label":"td","span":{"begin":15899,"end":15904}},{"label":"td","span":{"begin":15906,"end":15910}},{"label":"tr","span":{"begin":15911,"end":15938}},{"label":"td","span":{"begin":15911,"end":15912}},{"label":"td","span":{"begin":15914,"end":15919}},{"label":"td","span":{"begin":15921,"end":15925}},{"label":"td","span":{"begin":15927,"end":15932}},{"label":"td","span":{"begin":15934,"end":15938}},{"label":"tr","span":{"begin":15939,"end":15969}},{"label":"td","span":{"begin":15939,"end":15940}},{"label":"td","span":{"begin":15942,"end":15947}},{"label":"td","span":{"begin":15949,"end":15954}},{"label":"td","span":{"begin":15956,"end":15962}},{"label":"td","span":{"begin":15964,"end":15969}},{"label":"table-wrap-foot","span":{"begin":15970,"end":16202}},{"label":"p","span":{"begin":15970,"end":16202}},{"label":"p","span":{"begin":16203,"end":16267}},{"label":"p","span":{"begin":16268,"end":16424}},{"label":"p","span":{"begin":16425,"end":16459}},{"label":"p","span":{"begin":16460,"end":16497}},{"label":"p","span":{"begin":16498,"end":16523}},{"label":"p","span":{"begin":16524,"end":16628}},{"label":"p","span":{"begin":16629,"end":16703}},{"label":"p","span":{"begin":16704,"end":16744}},{"label":"p","span":{"begin":16745,"end":16755}},{"label":"p","span":{"begin":16756,"end":17372}},{"label":"p","span":{"begin":17373,"end":18129}},{"label":"p","span":{"begin":18130,"end":19113}},{"label":"p","span":{"begin":19114,"end":19959}},{"label":"p","span":{"begin":19960,"end":20476}},{"label":"label","span":{"begin":20477,"end":20484}},{"label":"caption","span":{"begin":20486,"end":20589}},{"label":"p","span":{"begin":20486,"end":20589}},{"label":"table","span":{"begin":20590,"end":22336}},{"label":"tr","span":{"begin":20590,"end":20647}},{"label":"td","span":{"begin":20590,"end":20601}},{"label":"td","span":{"begin":20603,"end":20611}},{"label":"td","span":{"begin":20613,"end":20629}},{"label":"td","span":{"begin":20631,"end":20647}},{"label":"tr","span":{"begin":20648,"end":22285}},{"label":"td","span":{"begin":20648,"end":20650}},{"label":"td","span":{"begin":20651,"end":20653}},{"label":"td","span":{"begin":20655,"end":20923}},{"label":"td","span":{"begin":20925,"end":21197}},{"label":"td","span":{"begin":21199,"end":21467}},{"label":"td","span":{"begin":21469,"end":21741}},{"label":"td","span":{"begin":21743,"end":22011}},{"label":"td","span":{"begin":22013,"end":22285}},{"label":"tr","span":{"begin":22286,"end":22336}},{"label":"td","span":{"begin":22286,"end":22291}},{"label":"td","span":{"begin":22293,"end":22297}},{"label":"td","span":{"begin":22299,"end":22304}},{"label":"td","span":{"begin":22306,"end":22310}},{"label":"td","span":{"begin":22312,"end":22317}},{"label":"td","span":{"begin":22319,"end":22323}},{"label":"td","span":{"begin":22325,"end":22330}},{"label":"td","span":{"begin":22332,"end":22336}}],"tracks":[]}