Construction of density kernel
The shape of the density kernel should match the fractal nature of the iterative USM function itself. The solution reported here will first be described for a USM coordinate, and illustrated for an arbitrary coordinate of the map, say the horizontal dimension of the forward map in Figure 1. The value, K, of the proposed Kernel function (Equation 2) in map coordinate position u, has two user-defined parameters, memory length, L, and smoothing, S, which is the ratio between the areas assigned to two consecutive Markov orders (e.g. S = 2 implies the kernel density area assigned to order i ≤ L-1 is twice the area assigned to order i-1).
K ( u ) = ∑ j = 1  N  ∑ i = 1  L  { H ( i , D , L , S ) ← L B ( i , x j  ) < u < U B ( i , x j  )   0 ← o t h e r w i s e               E q u a t i o n   2   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWscqGGOaakcqWG1bqDcqGGPaqkcqGH9aqpdaaeWbqaamaaqahabaWaaiqaaqaabeqaaiabdIeaijabcIcaOiabdMgaPjabcYcaSiabdseaejabcYcaSiabdYeamjabcYcaSiabdofatjabcMcaPiabgcziSkabdYeamjabdkeacjabcIcaOiabdMgaPjabcYcaSiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyipaWJaemyDauNaeyipaWJaemyvauLaemOqaiKaeiikaGIaemyAaKMaeiilaWIaemiEaG3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkaeaacqaIWaamcqGHqgcRcqWGVbWBcqWG0baDcqWGObaAcqWGLbqzcqWGYbGCcqWG3bWDcqWGPbqAcqWGZbWCcqWGLbqzaaGaay5EaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGmbata0GaeyyeIuoaaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aOGaaCzcaiaaxMaaieqacqWFfbqrcqWFXbqCcqWF1bqDcqWFHbqycqWF0baDcqWFPbqAcqWFVbWBcqWFUbGBcqqGGaaicqWFYaGmaaa@7F88@
The parameter D is the number of dimensions of the unitary USM hypercubes (e.g. d = 2 for the example in Figure 1) and the expression in Equation 2 simply states that the kernel density value in position u is obtained by adding the values of H, for each of the orders up to L-1, which makes it a scale dependent height function, for the number of elements of the kernel training dataset, x, that are positioned within a scale dependent neighborhood confined by lower and upper boundaries, LB and UB, respectively. The choice of memory length, L, of the kernel, sets the resolution of the density function. This is graphically reflected by the finer grain of the density distribution for higher values of L in Figure 2 and Figure 3.
Figure 2  Illustration of the Kernel density, K, for a small binary sequence, "ABABAAA", along its single USM axis, using different values for memory length, L, and smoothing, S. The same seven coordinates are used in all plots which implies that each of the 6 density plots have a similar area of 7 kernel units.
Figure 3  Determination of Kernel density, Equation 2, in the forward map of the sequence "ACTGCCC" used to produce Figure. To illustrate the effect of using different settings for memory length, L, and smoothness, S, The kernel density was determined for the four different combinations of L = {4, 5} and S = {1, 1/3}. As will be shown next, the kernel volume defined by this surface is equal to the number of points (sequence units), N, of the kernel-training dataset x. This result, strictly considered, disqualifies K as a kernel density function as kernel density volumes are unitary by definition. There are a number of reasons why having a volume that is the number of sequence units is desirable, particularly when sequences of different lengths are being compared. A compliant alternative definition of K is in any case obtained by dividing the expression in Equation 2, by the total length of the training sequences, N. This alternative will not be explicitly explored here because the scale alteration is so straight forward that it can easily be applied to any of the results reported here. The 2D density plots are offered without a scale in the z-axis to highlight the inconsequence of the correction. On the other hand, when multiple sequences are plotted together, as in Figure 4, the effect is that that the same motif in two sequences is represented with the same density height, Equation 3, even if the two sequences have very different lengths.
Figure 4  Kernel density for L = 4 and S = 1 applied to the concatenation of 20 promoter regions of Bacillus subtilis (see Discussion). The density is displayed both as a 3D bar (top) and as a 2D gray scale heat map (bottom). The accurate capturing of conserved tetranucleotide segments is illustrated for the TATA-box in the latter view, and for the TTGACA binding site at position -35 in the former. The two views also illustrate the two types of decomposition of conserved sequences. For the TTGACA sequence the decomposition is performed for the resolution of the kernel (L = 4) and all 3 tetranucleotides embedded in the 6 unit sequence are identified. The density scale is normalized to the length of the sequence so the average height is one unit – which is to say that the area of the density distribution is, as it should for a unit square base, unitary by definition. The three tables at the top detail the densities of the possible tetranucleotides for each of the trinucleotide quadrants. It can be observed that in each of them the conserved segment invariably has the highest density. The decomposition of the TATA-box, in the bottom view is instead illustrated for a succession of scales, from mononucleotide to tetranucleotide. The cumulative distribution of densities is displayed at the top left, disclosing a skew towards lower values, with over 60% of densities are below the unit average. The kernel density definition in Equation 2 is completed by two more expressions, Equation 3 and Equation 4, where the height function and its boundaries are detailed. The kernel density height function, Equation 3, establishes the step height added at each memory length smaller or equal to the value of L. It is useful to recall that the memory length is one unit smaller than the Markovian order, e.g. for nucleotide sequences, memory length one corresponds to mono nucleotide frequencies, memory length two corresponds to di-nucleotide frequencies, which populate a first order Markov transition table, and so on.
H ( i , D , L , S ) = ( 2 D  / S )  i   ∑ r = 0  L  S − L             E q u a t i o n   3   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGibascqGGOaakcqWGPbqAcqGGSaalcqWGebarcqGGSaalcqWGmbatcqGGSaalcqWGtbWucqGGPaqkcqGH9aqpdaWcaaqaaiabcIcaOiabikdaYmaaCaaaleqabaGaemiraqeaaOGaei4la8Iaem4uamLaeiykaKYaaWbaaSqabeaacqWGPbqAaaaakeaadaaeWbqaaiabdofatnaaCaaaleqabaGaeyOeI0IaemitaWeaaaqaaiabdkhaYjabg2da9iabicdaWaqaaiabdYeambqdcqGHris5aaaakiaaxMaacaWLjaacbeGae8xrauKae8xCaeNae8xDauNae8xyaeMae8hDaqNae8xAaKMae83Ba8Mae8NBa4MaeeiiaaIae83mamdaaa@575D@
The boundary values set by the functions LB and UB, Equation 4, define the neighborhood of a training sequence unit, that is, neighborhood to its USM position, x, which will have the corresponding value of H, Equation 3, added to the kernel density height, as detailed in Equation 2.
L B ( i , x ) = f l o o r ( x ⋅ 2 i  )  2 i      U B ( i , x ) = f l o o r ( x ⋅ 2 i  ) + 1  2 i          E q u a t i o n   4     MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakqaabeqaaiabdYeamjabdkeacjabcIcaOiabdMgaPjabcYcaSiabdIha4jabcMcaPiabg2da9maalaaabaGaemOzayMaemiBaWMaem4Ba8Maem4Ba8MaemOCaiNaeiikaGIaemiEaGNaeyyXICTaeGOmaiZaaWbaaSqabeaacqWGPbqAaaGccqGGPaqkaeaacqaIYaGmdaahaaWcbeqaaiabdMgaPbaaaaaakeaacqWGvbqvcqWGcbGqcqGGOaakcqWGPbqAcqGGSaalcqWG4baEcqGGPaqkcqGH9aqpdaWcaaqaaiabdAgaMjabdYgaSjabd+gaVjabd+gaVjabdkhaYjabcIcaOiabdIha4jabgwSixlabikdaYmaaCaaaleqabaGaemyAaKgaaOGaeiykaKIaey4kaSIaeGymaedabaGaeGOmaiZaaWbaaSqabeaacqWGPbqAaaaaaOGaaCzcaiaaxMaaieqacqWFfbqrcqWFXbqCcqWF1bqDcqWFHbqycqWF0baDcqWFPbqAcqWFVbWBcqWFUbGBcqqGGaaicqWF0aanaaaa@707D@
Before illustrating the calculation of the kernel density for multi-dimensional USM hypercube it is useful to illustrate the procedure for the one-dimensional example of a binary sequence such as 'ABABAAA'. The corresponding USM forward coordinates would be [0.3138 0.6569 0.3284 0.6642 0.3321 0.1661 0.0830] and the corresponding kernel density, Equation 2, for all positions in the one-dimensional USM map are shown in Figure 2 for different values of memory length, L, and smoothing, S.
Figure 2 illustrates how the choice of parameters will set both the resolution and detail of the pattern representation. If smoothing is set to +∞ then the kernel density will be distributed between the different fractions exactly as it would in a Markov transition matrix with the same memory length. This becomes clearer when a two dimension example is used such as the more familiar representation of nucleotide sequences. To illustrate this procedure, Equation 2 was applied to the forward map of a small nucleotide sequence represented in Figure 1, which results in the density distribution represented in Figure 3.