3.7 Numerical results
In this section, we apply the results presented above to a practical local score study. We consider the complete protein database of Swissprot release 47.8 and the classical amino acid hydrophobic scale of Kyte-Doolittle given in table 1 ([10]). The database contains roughly 200 000 sequences of various lengths (empiric distribution given in figure 1).
Figure 1  Empiric distribution of Swissprot (release 47.8) protein lengths. In order to improve readability, 0.5% of sequences with length ∈ [2 000, 9 000] have been removed from this histogram. Once the best scoring segment has been determined for each of these sequences, we need to compute the corresponding p-values. According to [9], the asymptotic distribution of Hn is given (if mean score is < 0, which is precisely the case here) by the following conservative approximation:
ℙ(Hn ≥ a) ≃ 1 - exp (-nKe-aλ)     (21)
where constants λ and K depend on the scoring distribution.
With our hydrophobic scale and a distribution of amino-acids estimated on the entire database we get
λ = 5.144775 × 10-3     and     K = 1.614858 × 10-2
(computation performed with a C function implemented by Altschul). Once the constants are computed we could get all the approximated p-values very quickly (a few seconds for the 200 000 p-values).
On the other hand, our new algorithm allows to compute (for the very first time) the exact p-values for this example. As the chosen scoring function has a one digit precision level, we need to use a scale factor of M = 10 to fall back to the integer case. A C++ implementation (available on request) performed all the computations in roughly three hours on a Pentium 4 CPU 2.8 GHz (this means approximately 20 p-values computed by second).
We can see on figure 2 the comparison between exact values and Karlin's approximations. The conservative design of the approximations seems to be successful except for very short unsignificant sequences. While the approximations are rather close to perfection for sequences with more than 2 000 amino-acids, the smaller the sequence is, the worse the approximations get. This is obviously consistent with the asymptotic nature of Karlin's formula but seems to indicate that these approximations are not reliable for 99.5% of the sequence in the database (protein of length < 2 000).
Figure 2  Exact p-value against Karlin ones (in log scale). Color refers to a range of sequence lengths: smaller than 100 in black (≃ 20 000 sequences), between 100 and 200 in red (≃ 40 000 sequences), between 200 and 500 in orange (≃ 90 000 sequences), between 500 and 1000 in yellow (≃ 30 000 sequences), between 1000 and 2 000 in blue (≃ 6 000 sequences) and greater than 2 000 in green (≃ 1 000 sequences). The solid line represents y = x. Range have been chosen for readability and few dots with exact p-value smaller than 10-30 are hence missing. One should object that it exists ([1,2]) a well known finite size correction to formula (21) that might be useful, especially when considering short sequences. Unfortunately in our case, this correction does not seems to improve the quality of the approximations (data not shown) and we hence make the choice to ignore it.
In table 2 we compare the number of sequences predicted to have a significant hydrophobic segment at a certain e-value level by the two approaches. If the Karlin's approximations are used, many proteins are considered unsignificant while they are. For example, with the classical database threshold of 10-5, only few sequences (6%) are correctly identified by Karlin's approximations.
Table 2  Number of e-value smaller than a threshold are given for exact computations (exact) and asymptotic Karlin's approximations (Karlin). The last row gives the accuracy of asymptotic predictions (accuracy = Karlin/exact).
e-value  10-1  10-2  10-3  10-4  10-5  10-6
exact  9473  7772  6271  4563  3232  2348
Karlin  3417  2047  1056  439  195  96
accuracy  34%  26%  17%  10%  6%  4% We have seen that Karlin's approximations are often far too conservative to give accurate results, but what about the ranking ? Table 3 proposes the Kendall's tau rank correlation (see [16] chapter 14.6 for more details) which is equal to 1.0 for a complete rank agreement and equal to -1.0 for a complete inverse rank agreement. As we will certainly be interested in the most significant sequences produced by our study, we compute our Kendall's tau only on these sequences. When all sequence lengths are considered, Karlin's approximations show their total irrelevance to give correct ranking for the first 10 or 50 most significant p-values. Even when the 100 first p-values are taken into account, relative ranks given by Karlin's approximations are wrong in 63% of the cases, which is huge. However, in the case where the approximations values are close to the exact ones (sequence lengths greater than 2 000, which correspond only to 0.5% of the database), p-values obtained with both methods are highly correlated.
Table 3  Kendall's tau (rank correlation) comparing the most significant exact p-values (the reference) to the Karlin's approximations. The column "all" gives the result for all sequences while the Ri give the results for a certain range of sequence lengths: smaller than 100 for R1, between 100 and 200 for R2, between 200 and 500 for R3, between 500 and 1 000 for R4, between 1 000 and 2 000 for R5 and greater than 2 000 for R6.
number of p-values  all  R1  R2  R3  R4  R5  R6
10  0.30  0.64  0.24  -0.20  0.58  0.64  0.97
50  0.14  0.73  0.50  0.46  0.56  0.78  0.97
100  0.37  0.70  0.67  0.62  0.61  0.80  0.98