Introduction
DNA-binding proteins are pivotal to the cell functions such as DNA replication, transcriptional regulation, packaging recombination, DNA repair, DNA modification and other fundamental activities associated with DNA. For example, in eukaryotic cells, histones which is a typical type of DNA-binding protein often help package chromosomal DNA into a compact structure, and as another typical DNA-binding protein, restriction enzymes are DNA-cutting enzymes found in bacteria that recognize and cut DNA only at a particular sequence of nucleotides to serve a host-defense role. DNA-binding proteins represent a broad category of proteins, known to be highly diverse in sequence and structure. Structurally, they have been divided into eight structural groups, which were further classified 54 protein structural families[1,2]. Functionally, protein-DNA interactions play various roles across the entire genome as previously mentioned [3]. The past decade has witnessed tremendous progress in genome sequencing [4-7]. According to the Genome On Line Database, the complete sequenced genomes of almost 1000 cellular organisms have been released, and about 5000 active genome sequencing projects are on the way [8,9]. The unprecedented amount of genetic information has provided hundreds of thousands of protein sequences [10], indicating that a challenging problem to elucidate their functions is posed.
At present, several experimental techniques have been employed for identifying DNA-binding proteins, such as filter binding assays, genetic analysis, chromatin immunoprecipitation on microarrays, and X-ray crystallography. But experimental approaches for identifying the DNA-binding proteins are costly and time consuming. It would be highly desirable to develop computational approaches that can automatically determine whether a novel sequence binds to DNA or not. Therefore, a reliable identification of DNA-binding proteins with effective computational approach is an import research topic in the proteomics fields. It has been observed that many attempts have been made for identifying DNA-binding proteins and many effective computational predicting methods have been proposed for analyzing it in the literature. The computational methods represent a broad category of predicting methods for DNA-binding proteins, known to be highly diverse in classifiers and protein representation.
In terms of classifiers, the computational methods can be divided into template-based and machine-learning-based methods, depending on how they use the information from the putative DNA-binding proteins. Template-based methods can be further classified into two classes, one of which utilize a structural comparison protocol to detect significant structural similarity between the query and a template known to bind DNA at either the domain or the structural motif to assess the DNA-binding preference of the target sequence [11,12] and the other employ a sequence comparison protocol (such as PSI-BLAST) to detect significant sequence similarity between the query and a template known to bind DNA to evaluate the DNA-binding preference of the target sequence [13]. Machine-learning-based methods do not perform direct structural comparison, but typically follow a machine-learning framework. To obtain good predictive model, various machine-learning algorithms have been employed to construct classification models, such as support vector machine (SVM) [14-17], neural network [18-22], random forest [23], naïve Bayes classifier [24,25], nearest neighbor [26] and ensemble classifiers [27,28], [29]
In the task of computational protein function prediction, there are two major problems: choice of the classification algorithm and choice of the protein representation. Depending on the choice of protein representation, these computational predictive methods can be classified into two categories: i) analysis from protein structure [19,20,28,30] and ii) prediction from amino acid sequence[11,21,31-33]. In case of structure-based prediction methods, Stawiski et al. [19] examined positively charged patches on the surface of putative DNA-binding proteins in comparison with that on non DNA-binding proteins. They employed 12 features including the patch size, hydrogen-bonding potential, and the fraction of evolutionary conserved positively charged residues and other properties of the protein to train a neural network (NN) for identifying DNA-binding proteins. Ahmad and Sarai [20] trained a NN classifier using three features, including net charge, electric dipole and quadruple moments of the protein. Bhardwaj et al. [15] examined the sizes of positively charged patches on the surface of putative DNA-binding proteins. They based their SVM classifier on the protein's overall charge, overall and surface amino acid composition. Szilágyi and Skolnick [34] previously trained a logistic regression classifier using the amino acid composition, the asymmetry of the spatial distribution of specific residues and the dipole moment of the protein. Guy Nimrod and Andras Szilágyi et al. [23] recently developed a random forest classifier based on the electrostatic potential, cluster-based amino acid conservation patterns and the secondary structure content of the patches, as well as features of the whole protein including its dipole moment. Since the negative samples are much more than real DNA-binding proteins, this is an imbalanced binary classification problem from the view of machine learning. Song et al. [35] employed ensemble classifier [36] to solve this problem and improved the identification. Several methods considering the sequence-order effects were proposed, and the experimental results showed that this information can improve the predictive performance [37,38].
The accuracy of structure-based prediction methods is usually higher, but they can't be used in high throughput annotation, as it requires the high-resolution 3D structure of the query sequence. Until now, many computational methods have been proposed for identifying DNA-binding protein from their amino acid sequences directly. There are four different categories of protein sequence features and three kinds of sequence encoding methods have been proposed [31,39-41]. The four categories of features are composition information, structural and functional information, physicochemical properties and evolutionary information and the three kinds of coding methods are overall composition-transition-distribution called OCTD (Global method), autocross-covariance (ACC) transformation (Nonlocal method) and split amino acid (SAA) Transformation (Local method). A comprehensive survey of these methods can be found in related research work [42-44]. However, most of the present encoding methods provided limited information to explain the mechanisms of DNA-protein interactions. It is desirable to explore a novel encoding method that can reveal the binding mechanism of DNA-proteins interactions.
In the current study, to further advance the prediction accuracy and understand the binding mechanism of DNA-protein interaction, we presented here a novel encoding method called PSSM distance transformation (PSSM-DT) to transform the PSSM profiles of query sequences into uniform numeric representations. Then we constructed a DNA-binding protein identification method SVM-PSSM-DT by combining the PSSM-DT with SVM. The benchmark test and independent test showed that PSSM-DT is a promising protein encoding method.