P-values for multidimensional scaling
For the analysis of codon usage tables we developed a special similarity measure which has been derived from the well-known chi-square test for the comparison of two distributions. Unlike the classical chi-square test we do not decide whether two distributions are equal or not, but instead we only use the corresponding P-values to compute a similarity measure for the underlying codon usage tables. For each pair of genes we compare the corresponding codon distributions on the basis of the codon frequencies in the two genes. For a suitable similarity score we average the P-values of the amino acid specific chi-square tests. We start with the counts  for codon  of amino acid ai in the j-th gene. These counts sum up to  over the number Li of different codons for amino acid ai. Note that nij corresponds to the number of occurrences of amino acid ai in gene j. With these counts we compute the chi-square statistic for each pair (j, k) of genes:
The classical chi-square test for comparison of two distributions is based on the following proposition: under the null hypothesis that the corresponding samples were drawn from the same probability distribution, the variable  is asymptotically chi-square distributed with Li degrees of freedom. Here we do not perform a chi-square test, but rather calculate the P-value Pijk associated with the chi-square statistic . The P-values are obtained from the chi-square probability function which is an incomplete gamma function [18]. A small value of Pijk indicates a significant difference between the codon distributions of gene j and k with respect to amino acid ai. For a number of M genes in a genome we then assemble the M × M matrix S of similarity scores with non-negative elements
where na is the number of amino acids. Note that S has unit diagonal elements, i.e. Sjj = 1, because the P-value for tables with identical counts is one. Consequently all off-diagonal elements are in the range [0, 1].
In order to derive a suitable low-dimensional point representation of genes we apply classical multidimensional scaling (see e.g. [19]) to the above similarities. The objective is to find a two-dimensional point configuration with interpoint distances reflecting the codon usage similarities of the corresponding genes. To perform classical scaling based on similarities we first transform the similarity matrix S into a positive semi-definite matrix C by subtracting the smallest eigenvalue λmin of S from all of its diagonal elements:
C = S - λminI     (3)
where I is the M × M identity matrix. Note that this transformation preserves the equality of diagonal elements. With the M × M centering matrix H with elements
we finally obtain the matrix
B = HCH.     (5)
It can be shown that for a positive semi-definite matrix C the distance matrix D with elements obtained by the standard transformation  is Euclidean and B is a centered inner product matrix ([19], pp. 402). Therefore principal components can be obtained from (partial) eigenvalue decomposition of B. Thus, for 2D-visualization we compute the two leading eigenvectors x1 and x2 of B associated with the largest and second largest eigenvalue, respectively. The M components of x1 and x2 provide the x1 and x2 coordinates for the M genes, which are utilized for scatter plot visualization.