PMC:1647278 / 56375-65578
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/1647278","sourcedb":"PMC","sourceid":"1647278","source_url":"https://www.ncbi.nlm.nih.gov/pmc/1647278","text":"Practical case\nWe have seen with our first example that our approximation works very well in a simple case. Will this hold with more practical cases?\nTo answer this question, let us consider the following experimental design:\n• one pattern: W = acgtacgt;\n• two genomes: Escherichia coli K12 (ℓ = n = 4639675) and Mycoplasma genitalium (ℓ = n = 580076);\n• five Markov orders: m = 1 to m = 5 (larger m are not considered since the computation of C becomes then intractable).\nAs the sequence lengths and compositions of the two considered genomes differ a lot, we have to take a different value of Nobs for each organism: Nobs = 30 for M. genitalium and Nobs = 150 for E. coli. Proceeding as indicated in section \"simulations\", we use the algorithm 1 for each experiment.\nTable 1 Comparison of theoretical and empirical pattern statistic mean and standard deviation on Escherichia coli K12.\nm S σ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@\n1 35.57 0.28 35.57 0.27\n2 31.61 0.49 31.60 0.50\n3 46.75 1.04 46.77 1.03\n4 45.33 1.74 45.32 1.81\n5 62.27 3.45 62.36 3.34\nWe consider the pattern W = acgtacgt with Nobs = 150. The sequence length is ℓ = 4639675, we use an order m Markov model and a sample of size M = 1 000.\nTable 2 Comparison of theoretical and empirical pattern statistic mean and standard deviation on Mycoplasma genitalium.\nm S σ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@\n1 42.48 0.38 42.47 0.40\n2 44.62 0.78 44.62 0.81\n3 55.96 1.49 56.02 1.52\n4 55.06 3.39 55.48 3.48\n5 56.49 10.35 57.21* 9.09*\nWe consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580 076, we use an order m Markov model and a sample of size M = 1 000. (*) for 123 terms in the sample we got P (N) = 0 and hence, SN was not computed. Algorithm 1 simulations for one experiment in the practical case\n1: estimate the order m parameter π (and μ) from the original sequence. Although these parameters are estimated, they are considered as the true parameters;\n2: compute S = -log10 ℙ(N ≥ Nobs);\n3: compute σ using approximation (23)\n4: for j = 1 ... 1 000 do\n5: draw a random sequence Y = Y1 ... Yn according to and order m stationary Markov model of parameter π;\n6: compute N the frequency vector of all size m and size m + 1 words in Y;\n7: compute Sj = SN = -log10 ℙ(N ≥ Nobs);\n8: end for\n9: compute S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ (resp. σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@) the mean (resp. standard deviation) of the sample S1,..., Sj.\nWe can see on table 1 the results for E. coli. For each Markov model considered, our approximation of σ is very close to the empiric ones and, as with figure 1, the Gaussian distribution fit well to the empiric one (data not shown). Table 2 shows the same behaviour with M. genitalium except for m = 5 where σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ differs slightly more than in the other cases from its theoretical value. To understand this phenomenon, let us first recall the expression of P(N) for m = 5 using equation (15):\nP ( N ) = N 1 ( agctac ) × N 1 ( gctacg ) × N 1 ( ctacgt ) ( ℓ − m + 1 ) × N 0 ( gctac ) × N 0 ( ctacg ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakieqacqWFobGtcqGGPaqkcqGH9aqpdaWcaaqaaiab=5eaonaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeeyyaeMaee4zaCMaee4yamMaeeiDaqNaeeyyaeMaee4yamMaeiykaKIaey41aqRae8Nta40aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqqGNbWzcqqGJbWycqqG0baDcqqGHbqycqqGJbWycqqGNbWzcqGGPaqkcqGHxdaTcqWFobGtdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabbogaJjabbsha0jabbggaHjabbogaJjabbEgaNjabbsha0jabcMcaPaqaaiabcIcaOiabloriSjabgkHiTiabd2gaTjabgUcaRiabigdaXiabcMcaPiabgEna0kab=5eaonaaBaaaleaacqaIWaamaeqaaOGaeiikaGIaee4zaCMaee4yamMaeeiDaqNaeeyyaeMaee4yamMaeiykaKIaey41aqRae8Nta40aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqqGJbWycqqG0baDcqqGHbqycqqGJbWycqqGNbWzcqGGPaqkaaaaaa@7A52@\nand as ℙ(N1 (agctac) = 0) ≃ 2.26 × 10-6, ℙ(N1 (gctacg) = 0) ≃ 1.35 × 10-1 and ℙ(N1 (ctacgt) = 0) ≃ 1.24 × 10-4 we will have P(N) = 0 roughly 14% of the time. This happened 123 times in our sample of size 1 000, each time preventing to compute SN. The sample is hence biased and S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ and σ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ are therefore not accurate.\nWhat happen now if we use another statistical method to compute the pattern statistics. As the binomial approximation is supposed to be close to the exact solution, we expect the standard deviation obtained with other statistical methods to remain close to σ. In table 3, we compare the empirical results using binomial approximations (like above) but also compound Poisson or large deviations approximations. Both empirical means and standard deviations are close to the theoretical ones thus validating the method.\nTable 3 Comparison of theoretical and empirical pattern statistics mean and deviation on Mycoplasma genitalium.\ntheoretical binomial compound Poisson large deviations\nS σ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@ S ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ σ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFdpWCgaqcaaaa@2E86@\n55.96 1.49 56.05 1.47 55.42 1.45 54.27 1.43\nWe consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580076, we use an order m = 3 Markov model and a sample of size M = 1 000. The pattern statistics are computed (from left to right) through binomial, compound Poisson or large deviations approximations.","divisions":[{"label":"title","span":{"begin":0,"end":14}},{"label":"p","span":{"begin":15,"end":149}},{"label":"p","span":{"begin":150,"end":225}},{"label":"p","span":{"begin":226,"end":254}},{"label":"p","span":{"begin":255,"end":352}},{"label":"p","span":{"begin":353,"end":472}},{"label":"p","span":{"begin":473,"end":768}},{"label":"table-wrap","span":{"begin":769,"end":1735}},{"label":"label","span":{"begin":769,"end":776}},{"label":"caption","span":{"begin":778,"end":888}},{"label":"p","span":{"begin":778,"end":888}},{"label":"table","span":{"begin":889,"end":1582}},{"label":"tr","span":{"begin":889,"end":1442}},{"label":"td","span":{"begin":889,"end":891}},{"label":"td","span":{"begin":892,"end":894}},{"label":"td","span":{"begin":896,"end":898}},{"label":"td","span":{"begin":900,"end":1168}},{"label":"td","span":{"begin":1170,"end":1442}},{"label":"tr","span":{"begin":1443,"end":1470}},{"label":"td","span":{"begin":1443,"end":1444}},{"label":"td","span":{"begin":1446,"end":1451}},{"label":"td","span":{"begin":1453,"end":1457}},{"label":"td","span":{"begin":1459,"end":1464}},{"label":"td","span":{"begin":1466,"end":1470}},{"label":"tr","span":{"begin":1471,"end":1498}},{"label":"td","span":{"begin":1471,"end":1472}},{"label":"td","span":{"begin":1474,"end":1479}},{"label":"td","span":{"begin":1481,"end":1485}},{"label":"td","span":{"begin":1487,"end":1492}},{"label":"td","span":{"begin":1494,"end":1498}},{"label":"tr","span":{"begin":1499,"end":1526}},{"label":"td","span":{"begin":1499,"end":1500}},{"label":"td","span":{"begin":1502,"end":1507}},{"label":"td","span":{"begin":1509,"end":1513}},{"label":"td","span":{"begin":1515,"end":1520}},{"label":"td","span":{"begin":1522,"end":1526}},{"label":"tr","span":{"begin":1527,"end":1554}},{"label":"td","span":{"begin":1527,"end":1528}},{"label":"td","span":{"begin":1530,"end":1535}},{"label":"td","span":{"begin":1537,"end":1541}},{"label":"td","span":{"begin":1543,"end":1548}},{"label":"td","span":{"begin":1550,"end":1554}},{"label":"tr","span":{"begin":1555,"end":1582}},{"label":"td","span":{"begin":1555,"end":1556}},{"label":"td","span":{"begin":1558,"end":1563}},{"label":"td","span":{"begin":1565,"end":1569}},{"label":"td","span":{"begin":1571,"end":1576}},{"label":"td","span":{"begin":1578,"end":1582}},{"label":"table-wrap-foot","span":{"begin":1583,"end":1735}},{"label":"p","span":{"begin":1583,"end":1735}},{"label":"table-wrap","span":{"begin":1736,"end":2786}},{"label":"label","span":{"begin":1736,"end":1743}},{"label":"caption","span":{"begin":1745,"end":1856}},{"label":"p","span":{"begin":1745,"end":1856}},{"label":"table","span":{"begin":1857,"end":2553}},{"label":"tr","span":{"begin":1857,"end":2410}},{"label":"td","span":{"begin":1857,"end":1859}},{"label":"td","span":{"begin":1860,"end":1862}},{"label":"td","span":{"begin":1864,"end":1866}},{"label":"td","span":{"begin":1868,"end":2136}},{"label":"td","span":{"begin":2138,"end":2410}},{"label":"tr","span":{"begin":2411,"end":2438}},{"label":"td","span":{"begin":2411,"end":2412}},{"label":"td","span":{"begin":2414,"end":2419}},{"label":"td","span":{"begin":2421,"end":2425}},{"label":"td","span":{"begin":2427,"end":2432}},{"label":"td","span":{"begin":2434,"end":2438}},{"label":"tr","span":{"begin":2439,"end":2466}},{"label":"td","span":{"begin":2439,"end":2440}},{"label":"td","span":{"begin":2442,"end":2447}},{"label":"td","span":{"begin":2449,"end":2453}},{"label":"td","span":{"begin":2455,"end":2460}},{"label":"td","span":{"begin":2462,"end":2466}},{"label":"tr","span":{"begin":2467,"end":2494}},{"label":"td","span":{"begin":2467,"end":2468}},{"label":"td","span":{"begin":2470,"end":2475}},{"label":"td","span":{"begin":2477,"end":2481}},{"label":"td","span":{"begin":2483,"end":2488}},{"label":"td","span":{"begin":2490,"end":2494}},{"label":"tr","span":{"begin":2495,"end":2522}},{"label":"td","span":{"begin":2495,"end":2496}},{"label":"td","span":{"begin":2498,"end":2503}},{"label":"td","span":{"begin":2505,"end":2509}},{"label":"td","span":{"begin":2511,"end":2516}},{"label":"td","span":{"begin":2518,"end":2522}},{"label":"tr","span":{"begin":2523,"end":2553}},{"label":"td","span":{"begin":2523,"end":2524}},{"label":"td","span":{"begin":2526,"end":2531}},{"label":"td","span":{"begin":2533,"end":2538}},{"label":"td","span":{"begin":2540,"end":2546}},{"label":"td","span":{"begin":2548,"end":2553}},{"label":"table-wrap-foot","span":{"begin":2554,"end":2786}},{"label":"p","span":{"begin":2554,"end":2786}},{"label":"p","span":{"begin":2787,"end":2851}},{"label":"p","span":{"begin":2852,"end":3008}},{"label":"p","span":{"begin":3009,"end":3043}},{"label":"p","span":{"begin":3044,"end":3081}},{"label":"p","span":{"begin":3082,"end":3107}},{"label":"p","span":{"begin":3108,"end":3212}},{"label":"p","span":{"begin":3213,"end":3287}},{"label":"p","span":{"begin":3288,"end":3328}},{"label":"p","span":{"begin":3329,"end":3339}},{"label":"p","span":{"begin":3340,"end":3956}},{"label":"p","span":{"begin":3957,"end":4713}},{"label":"p","span":{"begin":4714,"end":5697}},{"label":"p","span":{"begin":5698,"end":6543}},{"label":"p","span":{"begin":6544,"end":7060}},{"label":"label","span":{"begin":7061,"end":7068}},{"label":"caption","span":{"begin":7070,"end":7173}},{"label":"p","span":{"begin":7070,"end":7173}},{"label":"table","span":{"begin":7174,"end":8920}},{"label":"tr","span":{"begin":7174,"end":7231}},{"label":"td","span":{"begin":7174,"end":7185}},{"label":"td","span":{"begin":7187,"end":7195}},{"label":"td","span":{"begin":7197,"end":7213}},{"label":"td","span":{"begin":7215,"end":7231}},{"label":"tr","span":{"begin":7232,"end":8869}},{"label":"td","span":{"begin":7232,"end":7234}},{"label":"td","span":{"begin":7235,"end":7237}},{"label":"td","span":{"begin":7239,"end":7507}},{"label":"td","span":{"begin":7509,"end":7781}},{"label":"td","span":{"begin":7783,"end":8051}},{"label":"td","span":{"begin":8053,"end":8325}},{"label":"td","span":{"begin":8327,"end":8595}},{"label":"td","span":{"begin":8597,"end":8869}},{"label":"tr","span":{"begin":8870,"end":8920}},{"label":"td","span":{"begin":8870,"end":8875}},{"label":"td","span":{"begin":8877,"end":8881}},{"label":"td","span":{"begin":8883,"end":8888}},{"label":"td","span":{"begin":8890,"end":8894}},{"label":"td","span":{"begin":8896,"end":8901}},{"label":"td","span":{"begin":8903,"end":8907}},{"label":"td","span":{"begin":8909,"end":8914}},{"label":"td","span":{"begin":8916,"end":8920}}],"tracks":[]}