Then the best subset of features is chosen to maximize the multiple linear regression output. The set of features that shows the most correlation to the performance consists of the relative entropy of the binding site alignment, the position distribution of the binding sites in the sequences, and the length of the single sequence in the dataset. The result is exhibited in Figure 5(a). The adjusted coefficient of determination is about 68%, with a p-value less than 0.001. The regression residues versus the estimated response (Figure 5(b)) doesn't indicate evident inequality of variance, which is an important assumption of linear regression the requires that regression residues are independent of the expected value of Y.