PMC:1687185 / 9233-15883
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"17129371-12651719-1691525","span":{"begin":6461,"end":6463},"obj":"12651719"}],"text":"3.1 Problem Formulation\nSome promising initial alignments are obtained by applying projection methods or random starts on the entire dataset. Typically, random starts are used because they are cost efficient. The most promising sets of alignments are considered for further processing. These initial alignments are then converted into profile representation.\nLet t be the total number of sequences and S = {S1, S2...St} be the set of t sequences. Let P be a single alignment containing the set of segments {P1, P2, ..., Pt}. l is the length of the consensus pattern. For further discussion, we use the following variables\ni = 1 ... t - - for t sequences\nk = 1 ... l - - for positions within an l-mer\nj ∈ {A, T, G, C} - - for each nucleotide\nThe count matrix can be constructed from the given alignments as shown in Table 1. We define C0, j to be the overall background count of each nucleotide in all of the sequences. Similarly, Ck, j is the count of each nucleotide in the kth position (of the l - mer) in all the segments in P.\nTable 1 Position Count Matrix.\nj k = 0 k = 1 k = 2 k = 3 k = 4 ... k = l\nA C 0,1 C 1,1 C 2,1 C 3,1 C 4,1 ... C l,1\nT C 0,2 C 1,2 C 2,2 C 3,2 C 4,2 ... C l,2\nG C 0,3 C 1,3 C 2,3 C 3,3 C 4,3 ... Cl,3\nC C 0,4 C 1,4 C 2,4 C 3,4 C 4,4 ... C l,4\nA count of nucleotides A,T,G,C at each position K = 1..l in all the sequences of the data set. K = 0 denotes the background count. Q 0 , j = C 0 , j ∑ J ∈ { A , T , G , C } C 0 , J ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaeGimaaJaeiilaWIaemOAaOgabeaakiabg2da9maalaaabaGaem4qam0aaSbaaSqaaiabicdaWiabcYcaSiabdQgaQbqabaaakeaadaaeqaqaaiabdoeadnaaBaaaleaacqaIWaamcqGGSaalcqWGkbGsaeqaaaqaaiabdQeakjabgIGiolabcUha7jabdgeabjabcYcaSiabdsfaujabcYcaSiabdEeahjabcYcaSiabdoeadjabc2ha9bqab0GaeyyeIuoaaaGccaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@4D26@\nQ k , j = C k , j + b j t + ∑ J ∈ { A , T , G , C } b J ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaem4AaSMaeiilaWIaemOAaOgabeaakiabg2da9maalaaabaGaem4qam0aaSbaaSqaaiabdUgaRjabcYcaSiabdQgaQbqabaGccqGHRaWkcqWGIbGydaWgaaWcbaGaemOAaOgabeaaaOqaaiabdsha0jabgUcaRmaaqababaGaemOyai2aaSbaaSqaaiabdQeakbqabaaabaGaemOsaOKaeyicI4Saei4EaSNaemyqaeKaeiilaWIaemivaqLaeiilaWIaem4raCKaeiilaWIaem4qamKaeiyFa0habeqdcqGHris5aaaakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@528F@\nEq. (1) shows the background frequency of each nucleotide. bj (and bJ) is known as the Laplacian or Bayesian correction and is equal to d * Q0, j where d is some constant usually set to unity. Eq. (2) gives the weight assigned to the type of nucleotide at the kth position of the motif.\nA Position Specific Scoring Matrix (PSSM) can be constructed from one set of instances in a given set of t sequences. From (1) and (2), it is obvious that the following relationship holds:\n∑ j ∈ { A , T , G , C } Q k , j = 1 ∀ k = 0 , 1 , 2 , ... l ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaWaaabuaeaacqWGrbqudaWgaaWcbaGaem4AaSMaeiilaWIaemOAaOgabeaaaeaacqWGQbGAcqGHiiIZcqGG7bWEcqWGbbqqcqGGSaalcqWGubavcqGGSaalcqWGhbWrcqGGSaalcqWGdbWqcqGG9bqFaeqaniabggHiLdGccqGH9aqpcqaIXaqmaeaacqGHaiIicqWGRbWAcqGH9aqpcqaIWaamcqGGSaalcqaIXaqmcqGGSaalcqaIYaGmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqWGSbaBaaGaaCzcaiaaxMaadaqadaqaaiabiodaZaGaayjkaiaawMcaaaaa@531A@\nFor a given k value in (3), each Q can be represented in terms of the other three variables. Since the length of the motif is l, the final objective function (i.e. the information content score) would contain 3l independent variables. It should be noted that even if there are 4l variables in total, the parameter space will contain only 3l independent variables because of the constraints obtained from (3). Thus, the constraints help in reducing the dimensionality of the search problem.\nTo obtain the information content (IC) score, every possible l - mer in each of the t sequences must be examined. This is done so by multiplying the respective Qi, j/Q0, j dictated by the nucleotides and their respective positions within the l - mer. Only the highest scoring l - mer in each sequence is noted and kept as part of the alignment. The total score is the sum of all the best (logarithmic) scores in each sequence.\nA ( Q ) = ∑ i = 1 t l o g ( A ) i = ∑ i = 1 t l o g ( ∏ k = 1 l Q k , j Q b ) i ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqGGOaakcqWGrbqucqGGPaqkcqGH9aqpdaaeWbqaaGqaciab=XgaSjab=9gaVjab=DgaNjabcIcaOiabdgeabjabcMcaPmaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdsha0bqdcqGHris5aOGaeyypa0ZaaabCaeaacqWFSbaBcqWFVbWBcqWFNbWzdaqadaqaamaarahabaWaaSaaaeaacqWGrbqudaWgaaWcbaGaem4AaSMaeiilaWIaemOAaOgabeaaaOqaaiabdgfarnaaBaaaleaacqWGIbGyaeqaaaaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGSbaBa0Gaey4dIunaaOGaayjkaiaawMcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemiDaqhaniabggHiLdGcdaWgaaWcbaGaemyAaKgabeaakiaaxMaacaWLjaWaaeWaaeaacqaI0aanaiaawIcacaGLPaaaaaa@629A@\nwhere Qk, j/Qb represents the ratio of the nucleotide probability to the corresponding background probability. Log(A)i is the score at each individual ith sequence. In equation (4), we see that A is composed of the product of the weights for each individual position k. We consider this to be the Information Content (IC) score which we would like to maximize. A(Q) is the non-convex 3l dimensional continuous function for which the global maximum corresponds to the best possible motif in the dataset. EM refinement performed at the end of a combinatorial approach has the disadvantage of converging to a local optimal solution [22]. Our method improves the procedure for refining motif by understanding the details of the stability boundaries and by trying to escape out of the convergence region of the EM algorithm."}