Simulations
It is also possible to study the empirical distribution of a SN (for one or more patterns) through simulations.
In order to do so, we first draw M independent sequences Yj = Y1j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGzbqwdaqhaaWcbaGaeGymaedabaGaemOAaOgaaaaa@3061@ ... Ynj MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGzbqwdaqhaaWcbaGaemOBa4gabaGaemOAaOgaaaaa@30D6@ using an order m stationary Markov model of parameters π. Complexity of this step is O(M × n).
For each j we get the frequencies Nj = (N0j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFobGtdaqhaaWcbaGaeGimaadabaGaemOAaOgaaaaa@304F@,N1j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieqacqWFobGtdaqhaaWcbaGaeGymaedabaGaemOAaOgaaaaa@3051@) (with complexity O(n) for each sequence) of the words of size m and m + 1 in the sequence Yj and use it to compute Sj = SNj MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaacbeGae8Nta40aaWbaaWqabeaacqWGQbGAaaaaleqaaaaa@30C8@ (exact value or approximation). Complexity here depends on the statistical method used to compute Sj (e.g. O(h) using a binomial approximation).
We now have a M – sample S1, ..., SM of SN from which we can easily estimate σ and thus, valid or invalid the approximation through the delta-method.
When used with large value of n (e.g. several millions or more), the complexity of this approach is slowed by the drawn of the sequences Yj. It is therefore possible to improve the method by simulating directly the frequencies N through (5). As this approximation has a very small impact on the distribution of SN (data not shown) it may dramatically speed-up the computations when considering large n or M. It is nevertheless important to point out that drawing a Gaussian vector size L requires to precompute the Choleski decomposition of its covariance matrix which could be a limiting factor when considering large L.