Conclusion
The delta-method allows us to approximate the distribution of S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@ by a Gaussian distribution. This first requires to compute the expectation and covariance matrix of frequencies and then to study the derivative of a function which is specific of the method used to compute the pattern statistics. In the case of the binomial approximations, we have found an explicit expression of σ the standard deviation of S^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqcaaaa@2DEB@.
It is clear that our approximation of σ using the delta-method relies one two major assumptions: 1) the distribution of N is Gaussian; 2) F+ is regular enough (e.g. not too steep) around E. When m grows, E closes to the boundary of the definition range of F+ hence degrading assumption 2. Moreover, it is well known that Gaussian approximations for word frequencies become weaker when the expected numbers of their occurrences become smaller, thus degrading assumption 1. It is therefore obvious that our approximation of σ will get less and less reliable as m grows.
However, the approximation of σ has been validated through simulations and appears to be very reliable (even for m = 5 or 6). As pattern statistics computed through binomial approximations are close to the exact statistics [8], the value of σ should not differ a lot when another statistical method is used. We have compared our approximations to the empiric distribution obtained using compound Poisson and large deviations approximations and, as expected, our approximations remains quite reliable even for these statistical methods.
The variability due to parameter estimation is of course related to the Markov model order m and to the size k of the alphabet (as we have km+1 parameters for this model) and to the length n of the sequence used for this estimation. For example, considering an order m = 6 model with n = 4639675 (Escherichia coli K12 complete genome) requires to estimate 3 × 46 = 4096 free parameters which results roughly in 400 observation per free parameter. Although this situation seems quite comfortable, we have seen with our simulations that it leads an unacceptable variability for pattern statistics.
As literature often advices to use the highest possible Markov order for a given pattern problem (which means m = h - 2 for pattern of size h) it is easy to understand that such a practice could have very detrimental effects on the computed statistics unless huge data are available for estimation purpose. Even if we consider the more reasonable attitude to choose m using the classical framework of model selection (e.g. using the Akaike Information Criterion – AIC –) we get m = 5 for Mycoplasma genitalium and m = 6 for Escherichia coli K12 hence resulting in both cases in the same catastrophic results in terms of false positive and even worse ones in terms of ranking.
Moreover, we assumed here that our model was homogeneous all along the considered sequences. This is obviously completely false when complete genomes are considered. So it is more likely that the sample size n would be far smaller than a million on classical pattern studies (even of human genomes for example). As a result, the variability we pointed out in this paper will have a considerable detrimental effect on most studies unless the Markov order is carefully set.
In order to do so, we advice to compute our approximation of σ each time a pattern statistic is produced and then to evaluate, either by simulation (like in this paper) or by a theoretical work the impact of this variability on the considered study.