PMC:4331676 / 21290-27487
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"25708928-23029559-14842653","span":{"begin":156,"end":158},"obj":"23029559"},{"id":"25708928-12967962-14842654","span":{"begin":159,"end":161},"obj":"12967962"},{"id":"25708928-16551468-14842655","span":{"begin":1809,"end":1811},"obj":"16551468"},{"id":"25708928-9687024-14842656","span":{"begin":1960,"end":1962},"obj":"9687024"},{"id":"25708928-19865164-14842657","span":{"begin":2395,"end":2397},"obj":"19865164"},{"id":"25708928-15356290-14842658","span":{"begin":2732,"end":2734},"obj":"15356290"}],"text":"Feature analysis\nTo further investigate the importance of the features and reveal the biological meaning of the features in PSSM-DT, we followed the study [50,70,71] to calculate the discriminant weight vector in the feature space. The sequence-specific weight obtained from the SVM training process can be used to calculate the discriminant weight of each feature to measure the importance of the features. Given the weight vectors of the training set with N samples obtained from the kernel-based training A = [a1, a2, a3,...,aN], the feature discriminant weight vector W in the feature space can be calculated by the following equation:\n(12) W = A ⋅ M = a 1 a 2 ⋮ a N T m 11 m 12 ⋯ m 1 j m 21 m 22 ⋯ m 2 j ⋮ ⋮ ⋱ ⋮ m N 1 m N 2 ⋯ m N j\nwhere M is the matrix of sequence representatives in PSSM-DT; A is the weight vectors of the training samples; N is the number of training samples; j is the dimension of the feature vector. The element in W represents the discriminative power of the corresponding feature.\nIn this study, we are only interested in the descriptors frequently occurring in positive samples (DNA-binding proteins). Therefore, the discriminant weight of an amino acid pair is calculated as the quadratic sum of the discriminant weights of the corresponding descriptors with positive discriminant weight for this amino acid pair. The discriminant weights of all the 400 amino acid pairs in PSSM-DT are depicted in Figure 2A. According to this figure, the top four most discriminative amino acid pairs are (R, R), (R, P), (P, R) and (A, R), which indicate that the amino acid R (Arg) and A (Ala) are important for identifying the DNA-protein interaction. This conclusion is consistent with Szilágyi and Skolnick's study [34], in which they found that the percentage of Arg, Ala, Gly, Lys and Asp are useful for identification of DNA-binding proteins. Sieber and Allemann [72] found that R (348) can't directly interact with the nucleobases, but can determine the DNA binding specificity of the basic helix-loop-helix proteins (BHLH) E12 by directly interacting with both the phosphate backbone and the carboxylate of E(345) resulting in locking the side chain conformation of E(345). what's more, by comprehensively analyzing the three dimensional structures of protein-DNA complexes, Rohs and West et al. [73] demonstrated that the binding of R to narrow minor grooves can be applied to mode for protein-DNA recognition, indicating that R is an important component in protein-DNA binding activity. It has been previously reported that the DNA usually enveloped with negative electrostatic potential and the amino acid R shows positive charge [12], which explain the reason why the amino acid R is important for DNA-binding protein identification.\nFigure 2 Feature analysis on protein 1AKHchain A. (A) The discriminant weights of the 400 amino acid pairs. Each element in the figure refers to the quadratic sum of the discriminant weights of descriptors with positive discriminant weight for a certain amino acid pair. A amino acid pair is identified by two amino acids, the x-axis and y-axis represent its second amino acid and first amino acid, respectively. (B) The discriminant weights of the descriptors with different lg values for the top four most discriminant amino acid pairs, including pair(R,R), pair(R,P), pair(P,R) and pair(A,R). (C) The occurrence distributions of the descriptors for the top four most discriminant amino acid pairs on the DNA-binding regions and non DNA-binding regions of protein 1AKH chain A, respectively. The regions in green color are non DNA-binding regions and the region in grey color is DNA-binding protein. (D) The occurrence distributions of the descriptors for the top four most discriminant amino acid pairs on the three dimensional structure of protein 1AKH chain A. The green sections are the three dimensional structure of protein and the brown sections are the three dimensional structure of the DNA. The discriminant weight of the descriptors for pairs (R, R), (R, P), (P, R) and (A, R) with different lg values are shown in Figure 2B. As indicated by the figure, the descriptor with lg of 4 for pair (R, R) has the highest discriminant power. For pair (R, P) and (P, R), the discriminant weight of all descriptors are slightly different. In case of pair (A, R), the descriptor with lg of 5 is the most discriminative feature. In conclusion, for an amino acid pair, the distance between the two amino acids along the sequence can impact its discriminant power in DNA-binding protein identification.\nAdditionally, we take protein 1AKH [PDB:1AKH] chain A as an example to show the availability of PSSM-DT based protein representation on DNA-binding protein identification. 1AKH is known as the MATa1/MATα2 homeodomain heterodimer and its chain A is the yeast mating type transcription factors (MATa1). MATa1 proteins are members of the homeodomain superfamily of DNA-binding proteins and contact the DNA with its homeodomain. It always folds into a compact three-helix domain containing a helix-turn-helix DNA-binding motif. Figure 2C lists the distributions of descriptors for the top four most discriminative pairs on the sequence of MATa1 protein. From this figure we can see that there are 5 occurrences of the proposed descriptors in the DNA-binding region and no occurrence in the non DNA-binding regions. There are totally 5 descriptors occurred in the DNA-binding region, including pair(R, R) with lg of 1, pair(R, R) with lg of 3, pair(P, R) with lg of 2, pair(P, R) with lg of 3 and pair(A, R) with lg of 1. This is further confirmed by the three dimensional structure shown in Figure 2D. As indicated by the figure, there is no descriptor for the four top most discriminative amino acid pairs that occur in the non DNA-binding regions, and all the five occurrences are within the one DNA-binding region. Furthermore, the figure showed that the pair(R, R) with lg of 1and pair(P, R) with lg of 3 are very closed to the three dimensional structure of DNA, indicating that these two descriptors are very discriminative for DNA and protein interaction."}