PMC:1647278 / 1064-9749
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"17044916-16722531-1696306","span":{"begin":3077,"end":3078},"obj":"16722531"},{"id":"17044916-9719638-1696307","span":{"begin":3109,"end":3110},"obj":"9719638"},{"id":"17044916-15840704-1696308","span":{"begin":3111,"end":3112},"obj":"15840704"},{"id":"17044916-15662195-1696309","span":{"begin":3175,"end":3177},"obj":"15662195"}],"text":"Background\nIn order to study pattern occurrences in biological sequences, simple frequencies are not relevant in most cases because of pattern overlapping structure as well as composition bias in the sequences. A common workaround consists to compute the significance of an observation assuming the sequence X = X1 ... Xℓ over the finite alphabet A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@. (size k) is generated according to an order m ≥ 1 homogeneous, stationary and ergodic Markov model. Let π (size km+1) defined by\nπ(w, a) = ℙ(Xm+1 = a|X1...Xm = w) ∀(w, a) ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@m × A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ (1)\nbe the parameter of this Markov model, Π its transition matrix (note that we have Π = π only if m = 1) and μ its stationary distribution (defined by μ × Π = μ).\nWe then introduce the pattern statistic defined by\nS = { − log 10 ℙ ( N ≥ N obs ) if N obs ≥ E [ N ] log 10 ℙ ( N ≤ N obs ) if N obs \u003c E [ N ] ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWucqGH9aqpdaGabeqaauaabiqaciaaaeaacqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGymaeJaeGimaadabeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaOGae8xgHaLaeiikaGIaemOta4KaeyyzImRaemOta40aaSbaaSqaaiabb+gaVjabbkgaIjabbohaZbqabaGccqGGPaqkaeaacqqGPbqAcqqGMbGzcqqGGaaicqWGobGtdaWgaaWcbaGaee4Ba8MaeeOyaiMaee4CamhabeaakiabgwMiZkab=ri8fjabcUfaBjabd6eaojabc2faDbqaaiGbcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIXaqmcqaIWaamaeqaaOGae8xgHaLaeiikaGIaemOta4KaeyizImQaemOta40aaSbaaSqaaiabb+gaVjabbkgaIjabbohaZbqabaGccqGGPaqkaeaacqqGPbqAcqqGMbGzcqqGGaaicqWGobGtdaWgaaWcbaGaee4Ba8MaeeOyaiMaee4CamhabeaakiabgYda8iab=ri8fjabcUfaBjabd6eaojabc2faDbaaaiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@815C@\nwhere N is the random number of overlapping occurrences (i. e. X = aababaaba contains three overlapping occurrences of aba but only two non-overlapping ones) of a given fixed pattern on the random sequence X and Nobs is an observation.\nWhen π is known (and hence μ), several statistical methods are available to compute S: exact computations [1-4], Gaussian [5,6], binomial [7,8], compound Poisson [9-11] or large deviations approximations [12]. But in general, the parameter π is not available and must be estimated. Let us denote by N0 (resp. N1) the (overlap) frequencies of all words of size m (resp. m + 1) in the sequence Y = Y1 ... Yn, then the Maximum-Likelihood Estimator (MLE) of π is given by\nπ ^ ( w , a ) = N 1 ( w a ) ∑ b ∈ A N 1 ( w b ) ∀ ( w , a ) ∈ A m × A ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaafaqabeqacaaabaacciGaf8hWdaNbaKaacqGGOaakcqWG3bWDcqGGSaalcqWGHbqycqGGPaqkcqGH9aqpdaWcaaqaaGqabiab+5eaonaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaem4DaCNaemyyaeMaeiykaKcabaWaaabeaeaacqGFobGtdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdEha3jabdkgaIjabcMcaPaWcbaGaemOyaiMaeyicI4mcdaGae0haXheabeqdcqGHris5aaaaaOqaaiabgcGiIiabcIcaOiabdEha3jabcYcaSiabdggaHjabcMcaPiabgIGiolab9bq8bnaaCaaaleqabaGaemyBa0gaaOGaey41aqRae0haXheaaiaaxMaacaWLjaGaeiikaGIaeG4mamJaeiykaKcaaa@6555@\nand the MLE of μ (as a function of π) is therefore defined by μ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcaaaa@2E79@ × Π^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHGoaugaqcaaaa@2E3A@ = μ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF8oqBgaqcaaaa@2E79@ where Π^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHGoaugaqcaaaa@2E3A@ is the transition matrix associated to π^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcaaaa@2E80@\nWe introduce now the following estimators\nμ N ( w ) = N 0 ( w ) n − m + 1 a n d π N ( w , a ) = N 1 ( w a ) N 0 ( w ) ∀ ( w , a ) ∈ A m × A ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaafaqabeqaeaaaaeaaiiGacqWF8oqBdaWgaaWcbaacbeGae4Nta4eabeaakiabcIcaOiabdEha3jabcMcaPiabg2da9maalaaabaGae4Nta40aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG3bWDcqGGPaqkaeaacqWGUbGBcqGHsislcqWGTbqBcqGHRaWkcqaIXaqmaaaabaacbaGae0xyaeMae0NBa4Mae0hzaqgabaGae8hWda3aaSbaaSqaaiab+5eaobqabaGccqGGOaakcqWG3bWDcqGGSaalcqWGHbqycqGGPaqkcqGH9aqpdaWcaaqaaiab+5eaonaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaem4DaCNaemyyaeMaeiykaKcabaGae4Nta40aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG3bWDcqGGPaqkaaaabaGaeyiaIiIaeiikaGIaem4DaCNaeiilaWIaemyyaeMaeiykaKIaeyicI4mcdaGaeWhaXh0aaWbaaSqabeaacqWGTbqBaaGccqGHxdaTaaGaeWhaXhKaaCzcaiaaxMaacqGGOaakcqaI0aancqGGPaqkaaa@750F@\nwhich are known to be asymptotically equivalent with the MLE when n is large.\nThe quality of parameter estimation depends both on the number of parameters to estimate (km+1 for an order m Markov model) and of the length (n) of the homogeneous sequence used for their estimation. When the same sequence (or set of sequences) is used both for observed frequencies and parameter estimation, m should not be greater than h – 2 for a pattern of length h (as else, the observed frequency of the pattern will be included in the model). As literature often suggests to use the highest possible order, it is hence common to consider m = 6 or more (for a DNA pattern of size h ≥ 8). Moreover, because of the homogeneity assumption of the model, the considered genomes have often to be segmented first. As a result, the sequences length used for parameter estimations are often dramatically reduced by such segmentation (e. g. n = 105 to n = 106 at the very best for DNA sequences). It is hence quite common to encounter high order Markov models estimated on rather short sequences which could result in high sensitivity to parameter estimation.\nConsidering that Y is generated through a Markov model of parameter π, the main goal of this paper is to study the distribution of SN, the statistic S computed using the estimators μN and πN, and the consequences of its variability in projects using pattern statistics. We first present in details how the delta-method can be used to get a Gaussian approximation for the distribution of SN (using a binomial approximation to compute the pattern statistics). Then these approximations are validated through simulations and, at last, we consider a classical pattern study (finding the most over-represented patterns of a given size) and we evaluate the detrimental effect of parameter estimations both in terms of true positive rate and rank accordance."}