3 Preliminaries We will first describe our problem formulation and the details of the EM algorithm in the context of motif finding problem. We will then describe some details of the dynamical system of the log-likelihood function which enables us to search for the nearby local optimal solutions. 3.1 Problem Formulation Some promising initial alignments are obtained by applying projection methods or random starts on the entire dataset. Typically, random starts are used because they are cost efficient. The most promising sets of alignments are considered for further processing. These initial alignments are then converted into profile representation. Let t be the total number of sequences and S = {S1, S2...St} be the set of t sequences. Let P be a single alignment containing the set of segments {P1, P2, ..., Pt}. l is the length of the consensus pattern. For further discussion, we use the following variables i = 1 ... t     - - for t sequences k = 1 ... l     - - for positions within an l-mer j ∈ {A, T, G, C}     - - for each nucleotide The count matrix can be constructed from the given alignments as shown in Table 1. We define C0, j to be the overall background count of each nucleotide in all of the sequences. Similarly, Ck, j is the count of each nucleotide in the kth position (of the l - mer) in all the segments in P. Table 1 Position Count Matrix. j k = 0 k = 1 k = 2 k = 3 k = 4 ... k = l A C 0,1 C 1,1 C 2,1 C 3,1 C 4,1 ... C l,1 T C 0,2 C 1,2 C 2,2 C 3,2 C 4,2 ... C l,2 G C 0,3 C 1,3 C 2,3 C 3,3 C 4,3 ... Cl,3 C C 0,4 C 1,4 C 2,4 C 3,4 C 4,4 ... C l,4 A count of nucleotides A,T,G,C at each position K = 1..l in all the sequences of the data set. K = 0 denotes the background count. Q 0 , j = C 0 , j ∑ J ∈ { A , T , G , C } C 0 , J       ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaeGimaaJaeiilaWIaemOAaOgabeaakiabg2da9maalaaabaGaem4qam0aaSbaaSqaaiabicdaWiabcYcaSiabdQgaQbqabaaakeaadaaeqaqaaiabdoeadnaaBaaaleaacqaIWaamcqGGSaalcqWGkbGsaeqaaaqaaiabdQeakjabgIGiolabcUha7jabdgeabjabcYcaSiabdsfaujabcYcaSiabdEeahjabcYcaSiabdoeadjabc2ha9bqab0GaeyyeIuoaaaGccaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@4D26@ Q k , j = C k , j + b j t + ∑ J ∈ { A , T , G , C } b J       ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaem4AaSMaeiilaWIaemOAaOgabeaakiabg2da9maalaaabaGaem4qam0aaSbaaSqaaiabdUgaRjabcYcaSiabdQgaQbqabaGccqGHRaWkcqWGIbGydaWgaaWcbaGaemOAaOgabeaaaOqaaiabdsha0jabgUcaRmaaqababaGaemOyai2aaSbaaSqaaiabdQeakbqabaaabaGaemOsaOKaeyicI4Saei4EaSNaemyqaeKaeiilaWIaemivaqLaeiilaWIaem4raCKaeiilaWIaem4qamKaeiyFa0habeqdcqGHris5aaaakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@528F@ Eq. (1) shows the background frequency of each nucleotide. bj (and bJ) is known as the Laplacian or Bayesian correction and is equal to d * Q0, j where d is some constant usually set to unity. Eq. (2) gives the weight assigned to the type of nucleotide at the kth position of the motif. A Position Specific Scoring Matrix (PSSM) can be constructed from one set of instances in a given set of t sequences. From (1) and (2), it is obvious that the following relationship holds: ∑ j ∈ { A , T , G , C } Q k , j = 1 ∀ k = 0 , 1 , 2 , ... l       ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaWaaabuaeaacqWGrbqudaWgaaWcbaGaem4AaSMaeiilaWIaemOAaOgabeaaaeaacqWGQbGAcqGHiiIZcqGG7bWEcqWGbbqqcqGGSaalcqWGubavcqGGSaalcqWGhbWrcqGGSaalcqWGdbWqcqGG9bqFaeqaniabggHiLdGccqGH9aqpcqaIXaqmaeaacqGHaiIicqWGRbWAcqGH9aqpcqaIWaamcqGGSaalcqaIXaqmcqGGSaalcqaIYaGmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqWGSbaBaaGaaCzcaiaaxMaadaqadaqaaiabiodaZaGaayjkaiaawMcaaaaa@531A@ For a given k value in (3), each Q can be represented in terms of the other three variables. Since the length of the motif is l, the final objective function (i.e. the information content score) would contain 3l independent variables. It should be noted that even if there are 4l variables in total, the parameter space will contain only 3l independent variables because of the constraints obtained from (3). Thus, the constraints help in reducing the dimensionality of the search problem. To obtain the information content (IC) score, every possible l - mer in each of the t sequences must be examined. This is done so by multiplying the respective Qi, j/Q0, j dictated by the nucleotides and their respective positions within the l - mer. Only the highest scoring l - mer in each sequence is noted and kept as part of the alignment. The total score is the sum of all the best (logarithmic) scores in each sequence. A ( Q ) = ∑ i = 1 t l o g ( A ) i = ∑ i = 1 t l o g ( ∏ k = 1 l Q k , j Q b ) i       ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqGGOaakcqWGrbqucqGGPaqkcqGH9aqpdaaeWbqaaGqaciab=XgaSjab=9gaVjab=DgaNjabcIcaOiabdgeabjabcMcaPmaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdsha0bqdcqGHris5aOGaeyypa0ZaaabCaeaacqWFSbaBcqWFVbWBcqWFNbWzdaqadaqaamaarahabaWaaSaaaeaacqWGrbqudaWgaaWcbaGaem4AaSMaeiilaWIaemOAaOgabeaaaOqaaiabdgfarnaaBaaaleaacqWGIbGyaeqaaaaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGSbaBa0Gaey4dIunaaOGaayjkaiaawMcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemiDaqhaniabggHiLdGcdaWgaaWcbaGaemyAaKgabeaakiaaxMaacaWLjaWaaeWaaeaacqaI0aanaiaawIcacaGLPaaaaaa@629A@ where Qk, j/Qb represents the ratio of the nucleotide probability to the corresponding background probability. Log(A)i is the score at each individual ith sequence. In equation (4), we see that A is composed of the product of the weights for each individual position k. We consider this to be the Information Content (IC) score which we would like to maximize. A(Q) is the non-convex 3l dimensional continuous function for which the global maximum corresponds to the best possible motif in the dataset. EM refinement performed at the end of a combinatorial approach has the disadvantage of converging to a local optimal solution [22]. Our method improves the procedure for refining motif by understanding the details of the stability boundaries and by trying to escape out of the convergence region of the EM algorithm. 3.2 Hessian Computation and Dynamical System for the Scoring Function In order to present our algorithm, we have defined the dynamical system corresponding to the log-likelihood function and the PSSM. The key contribution of the paper is the development of this nonlinear dynamical system which will enable us to realize the geometric and dynamic nature of the likelihood surface by allowing us to understand the topology and convergence behaviour of any given subspace on the surface. We construct the following gradient system in order to locate critical points of the objective function (4): Q˙ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGrbqugaGaaaaa@2DE0@(t) = -∇ A(Q)     (5) One can realize that this transformation preserves all of the critical points [20]. Now, we will describe the construction of the gradient system and the Hessian in detail. In order to reduce the dominance of one variable over the other, the values of each of the nucleotides that belong to the consensus pattern at the position k will be represented in terms of the other three nucleotides in that particular column. Let Pikdenote the kth position in the segment Pi. This will also minimize the dominance of the eigenvector directions when the Hessian is obtained. The variables in the scoring function are transformed into new variables described in Table 2. Thus, Eq. (4) can be rewritten in terms of the 3l variables as follows: Table 2 Position Weight Matrix. A count of nucleotides j ∈ {A, T, G, C} at each position k = 1..l in all the sequences of the data set. Ck is the kth nucleotide of the consensus pattern which represents the nucleotide with the highest value in that column. Let the consensus pattern be GACT...G and bj be the background. j k = b k = 1 k = 2 K = 3 k = 4 ... k = l A b A w 1 C 2 w 7 w 10 ... w 3l-2 T b T w 2 w 4 w 8 C 4 ... w 3l-1 G b G C 1 w 5 w 9 w 11 ... C l C b C W 3 w 6 C 3 w 12 ... W 3l A ( Q ) = ∑ i = 1 t ∑ k = 1 l l o g   f i k ( w 3 k − 2 , w 3 k − 1 , w 3 k ) i       ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqGGOaakcqWGrbqucqGGPaqkcqGH9aqpdaaeWbqaamaaqahabaacbiGae8hBaWMae83Ba8Mae83zaCgaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGSbaBa0GaeyyeIuoaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdsha0bqdcqGHris5aOGaeeiiaaIaemOzay2aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGGOaakcqWG3bWDdaWgaaWcbaGaeG4mamJaem4AaSMaeyOeI0IaeGOmaidabeaakiabcYcaSiabdEha3naaBaaaleaacqaIZaWmcqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeiilaWIaem4DaC3aaSbaaSqaaiabiodaZiabdUgaRbqabaGccqGGPaqkdaWgaaWcbaGaemyAaKgabeaakiaaxMaacaWLjaWaaeWaaeaacqaI2aGnaiaawIcacaGLPaaaaaa@6150@ where fik can take the values {w3k-2, w3k-1, w3k, 1 - (w3k-2 + w3k-1 + w3k)} depending on the Pik value. The first derivative of the scoring function is a one dimensional vector with 3l elements. ∇ A = [ ∂ A ∂ w 1 ∂ A ∂ w 2 ∂ A ∂ w 3 .... ∂ A ∂ w 3 l ] T       ( 7 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHhis0cqWGbbqqcqGH9aqpdaWadaqaamaalaaabaGaeyOaIyRaemyqaeeabaGaeyOaIyRaem4DaC3aaSbaaSqaaiabigdaXaqabaaaaOWaaSaaaeaacqGHciITcqWGbbqqaeaacqGHciITcqWG3bWDdaWgaaWcbaGaeGOmaidabeaaaaGcdaWcaaqaaiabgkGi2kabdgeabbqaaiabgkGi2kabdEha3naaBaaaleaacqaIZaWmaeqaaaaakiabc6caUiabc6caUiabc6caUiabc6caUmaalaaabaGaeyOaIyRaemyqaeeabaGaeyOaIyRaem4DaC3aaSbaaSqaaiabiodaZiabdYgaSbqabaaaaaGccaGLBbGaayzxaaWaaWbaaSqabeaacqWGubavaaGccaWLjaGaaCzcamaabmaabaGaeG4naCdacaGLOaGaayzkaaaaaa@5671@ and each partial derivative is given by ∂ A ∂ w p = ∑ i = 1 t ∂ f i p ∂ w p f i k ( w 3 k − 2 , w 3 k − 1 , w 3 k )       ( 8 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabgkGi2kabdgeabbqaaiabgkGi2kabdEha3naaBaaaleaacqWGWbaCaeqaaaaakiabg2da9maaqahabaWaaSaaaeaadaWcaaqaaiabgkGi2kabdAgaMnaaBaaaleaacqWGPbqAcqWGWbaCaeqaaaGcbaGaeyOaIyRaem4DaC3aaSbaaSqaaiabdchaWbqabaaaaaGcbaGaemOzay2aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGGOaakcqWG3bWDdaWgaaWcbaGaeG4mamJaem4AaSMaeyOeI0IaeGOmaidabeaakiabcYcaSiabdEha3naaBaaaleaacqaIZaWmcqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeiilaWIaem4DaC3aaSbaaSqaaiabiodaZiabdUgaRbqabaGccqGGPaqkaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWG0baDa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI4aaoaiaawIcacaGLPaaaaaa@614C@ ∀p = 1, 2 ... 3l and k = round(p/3)+ 1 The Hessian ∇2A is a block diagonal matrix of block size 3 × 3. For a given sequence, the entries of the 3 × 3 block will be the same if that nucleotide belongs to the consensus pattern (Ck). The gradient system is mainly obtained for enabling us to identify the stability boundaries and stability regions on the likelihood surface. The theoretical details of these concepts are published in [20]. The stability region of each local maximum is an approximate convergence zone of the EM algorithm. If we can identify all the saddle points on the stability boundary of a given local maximum, then we will be able to find all the corresponding Tier-1 local maxima. Tier-1 local maximum is defined as the new local maximum that is connected to the original local maximum through one decomposition point. Similarly, we can define Tier-2 and Tier-k local maxima that will take 2 and k decomposition points respectively. However, finding every saddle point is computationally intractable and hence we have adopted a heuristic by generating the eigenvector directions of the PSSM at the local maximum. Also, for such a complicated likelihood function, it is not efficient to compute all saddle points on the stability boundary. Hence, one can obtain new local maxima by obtaining the exit points instead of the saddle points. The point along a particular direction where the function has the lowest value starting from the given local maximum is called the exit point. The next section details our approach and explains the different phases of our algorithm.