Statistical significance
Once we have found a motif of a particular SP-score, we evaluate its statistical significance by calculating the number of motifs of equal or better quality expected to occur in random data with the same characteristics. Let the score of the motif of length l in question be denoted by s, and let f(b) be the zero-corrected background frequency of nucleotide (or residue) b in the input sequences, and sim(b1, b2) be the integral score computed for all residue pairs as above. We compute Pl(X), the probability distribution of scores for a motif of length l in N sequences, in the first two steps of the following, and infer the e-value of score s in the last two:
1. Calculate the exact probability distribution P1(X) for a single column of N random residues. We use the multinomial distribution to compute the probability of observing every combination of bases (or residues) in the column according to the background distribution, and calculate the corresponding SP-score for the column. We then add probabilities for the same scores resulting from different base combinations. To make the computation feasible for the protein alphabet and for large numbers of sequences, we calculate the scores and probabilities in such an order that every new score and probability is computable from the previous one by a local update operation.
2. Calculate the probability distribution Pl(X) for l random columns by convolution of P1(X)as in [38], where we inductively construct a distribution for i columns based on the distribution for i - 1 columns, Pi-1(X), and the single column distribution P1(X).
3. For a given score s of interest, we calculate the probability that an l-long pattern has score greater than or equal to s by chance alone. This probability is ∑x>=s Pl(x).
4. Finally, we compute the total number of possible motifs of length l in the data. If the sequences have lengths L1, ..., LN, then the search space size L = ∏i (Li - l + 1). The expected number of alignments with score at least s by chance alone, or the e-value, is equal to L* ∑x>=s Pl(x).