PMC:4331676 / 9664-19076
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"25708928-23303794-14842648","span":{"begin":53,"end":55},"obj":"23303794"},{"id":"25708928-23144709-14842648","span":{"begin":53,"end":55},"obj":"23144709"},{"id":"25708928-23395824-14842648","span":{"begin":53,"end":55},"obj":"23395824"},{"id":"25708928-24109555-14842648","span":{"begin":53,"end":55},"obj":"24109555"},{"id":"25708928-24318998-14842648","span":{"begin":53,"end":55},"obj":"24318998"},{"id":"25708928-23029559-14842648","span":{"begin":53,"end":55},"obj":"23029559"},{"id":"25708928-19646926-14842648","span":{"begin":53,"end":55},"obj":"19646926"},{"id":"25708928-19925685-14842648","span":{"begin":53,"end":55},"obj":"19925685"},{"id":"25708928-20955175-14842648","span":{"begin":53,"end":55},"obj":"20955175"},{"id":"25708928-24027761-14842648","span":{"begin":53,"end":55},"obj":"24027761"},{"id":"25708928-23756733-14842648","span":{"begin":53,"end":55},"obj":"23756733"},{"id":"25708928-25016190-14842648","span":{"begin":53,"end":55},"obj":"25016190"},{"id":"25708928-24564479-14842648","span":{"begin":53,"end":55},"obj":"24564479"},{"id":"25708928-17237066-14842649","span":{"begin":2344,"end":2346},"obj":"17237066"},{"id":"25708928-19908123-14842649","span":{"begin":2344,"end":2346},"obj":"19908123"},{"id":"25708928-22272076-14842649","span":{"begin":2344,"end":2346},"obj":"22272076"},{"id":"25708928-11452024-14842650","span":{"begin":2481,"end":2483},"obj":"11452024"},{"id":"25708928-16284202-14842651","span":{"begin":7027,"end":7029},"obj":"16284202"},{"id":"25708928-24931825-14842652","span":{"begin":7030,"end":7032},"obj":"24931825"},{"id":"25708928-24504871-14842652","span":{"begin":7030,"end":7032},"obj":"24504871"},{"id":"25708928-19046430-14842652","span":{"begin":7030,"end":7032},"obj":"19046430"}],"text":"Methods\nAs shown by a series of recent publications [45-59] and summarized in a comprehensive review, to develop a useful statistical prediction method or model for a biological system, one needs to engage the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) construct a web-server for the prediction method. Below, we describe our proposed method followed such a general procedure.\n\nDataset\nTo construct a high quality benchmark dataset, only experimentally confirmed data were collected. The benchmark dataset S can be formulated as\n(1) S = S + ∪ S -\nwhere the subset S+ contains 525 DNA-binding proteins, the subset S- consists of 550 non DNA-binding proteins and the symbol ∪ represents the \"union\" in the set theory. The benchmark dataset was obtained according to the following procedure. (1) Extract DNA-binding protein sequences from Protein Data Bank (PDB) released at December 2013 by searching the mmCIF keyword of 'DNA binding protein' through the advanced search interface. (2) Remove the sequences with length of less than 50 amino acid residues and character of 'X'. (3) Utilize PISCES to cutoff those sequences that have \u003e= 25% pairwise sequence identity to any other in the same subset. Thus the subset S+ consisting 525 sequences is obtained. (4) Randomly extract some non DNA-binding proteins from Protein Data Bank, then utilize PISCES to cutoff those sequence that have \u003e= 25% pairwise sequence identity to any other in the same subset and remove all the sequences with less than 50 amino acids or with character of 'X'. Thus the subset S- containing 550 sequences is obtained. A complete list of all the PDB codes and sequence for the benchmark dataset can be found in Supporting Information S1.\n\nPosition Specific Scoring Matrix\nEvolutionary information, one of the most import kinds of information in protein functionality annotation in biological analysis, has been widely used in many studies [21,60-63]. In this study, evolutionary information in forms of PSSM profile of every protein sequence is obtained by running the PSI-BLAST [64] program to search the non-redundant (NR) database through three iteration with 0.001 as the E-value cutoff for multiple sequence alignment. The final PSSM profile is a matrix with dimension of L*20 (excluding dummy residue X), which can depicted as follows:\n(2) PSS M = S 1 , 1 S 1 , 2 ⋯ S 1 , 20 S 2 , 1 S 2 , 2 ⋯ S 2 , 20 ⋮ ⋮ ⋮ ⋮ S L , 1 S L , 2 ⋯ S L , 20\nwhere L is the length of protein, the Si,j represents the occurrence probability of amino acid j at position i of the protein sequence, the rows of matrix represent the positions of the sequence and the columns of the matrix represent the 20 types original amino acids. PSSM scores are generally shown as positive or negative integers. Positive scores indicate that the given amino acid occurs more frequently in the alignment than expected by chance, while negative scores indicate that the given amino acid occurs less frequently than expected. Large positive scores often indicate critical functional residues, which may be active site residues or residues required for other intermolecular interactions. Therefore the element of PSSM profile can be used to approximately measure the occurrence probability of the corresponding amino acid at a specific position.\n\nPSSM distance transformation\nIt has been reported that dipeptides containing two residues separated by a distance along the sequence are important for protein functionality annotation in the work [65]. Additionally, the PSSM score can approximately measure how frequently an amino acid occurs at a position of a sequence. Accordingly, we present here a PSSM distance transformation (PSSM-DT) method to encode the feature vector representation from the PSSM information. PSSM-DT can transform the PSSM information into uniform numeric representation by approximately measuring the occurrence probabilities of any pairs of amino acid separated by a distance along the sequence in a sequence. PSSM-DT results in two kinds of features: PSSM distance transformation of pairs of same amino acids (PSSM-SDT) and PSSM distance transformation of pairs of different amino acids (PSSM-DDT). The PSSM-SDT features approximately measure the occurrence probabilities of pairs of same amino acids separated by a distance of lg along the sequence in a sequence, which can be calculated as below\n(3) PSSM - SDT( i , l g ) = ∑ j = 1 L - l g S i , j * S i , j + l g / ( L - l g )\nwhere i is one type of the amino acid, L is the length of the sequence, Si,j is the PSSM score of amino acid j at position i. In such a way, 20*LG is the number of PSSM-SDT features, where LG is the maximum value of lg (lg = 1, 2,...,LG).\nThe PSSM-DDT features approximately measures the occurrence probabilities of pairs of different amino acids separated by a distance of lg along the sequence, which can be calculated by:\n(4) PSSM - DDT( i 1 , i 2 , l g ) = ∑ j = 1 L - l g S i 1 , j * S i 2 , j + l g / ( L - l g )\nwhere i1 and i2 refer to two different types amino acids. Similarly, the total number of PSSM-DDT features can be calculated as 380*LG.\nPSSM-DT is the combination of variable PSSM-SDT and PSSM-DDT. Thus a sequence can be transformed into a uniform feature vector with a fixed dimension of 400*LG by using variable PSSM-DT from its PSSM profile.\n\nSupport vector machine\nSupport vector machine is a machine learning algorithm based on statistical learning theory presented by Vapnik (1998) [66], which uses a non-linear transformation to map the input data to a high dimensional feature space where linear classification is performed. It is equivalent to solving the quadratic optimization problem:\n(5) min w , b , ξ i 1 2 w * w + C ∑ i ξ i\n(6) s .t . y i ( ϕ ( x i ) * w + b ) ≥ 1 - ξ i , i = 1 , . . . , m , ξ i ≥ 0 , i = 1 , . . . , m ,\nWhere xi is a feature vector labeled by yi ∈ {-1, +1} and C, called the cost, is the penalty parameter of the error term. The above model called soft-margin SVM can be able to tolerate noise within the data, which analyze an example by generating a separating hyper-plane with f(x) = ϕ(x)·w + b = 0. Through resolving the above model with lagrangian multiplier method, we obtain w=∑jαh*yj*ϕ(xj) and w⋅ϕ(xi)=∑jαh*yj*ϕ(xj)⋅ϕ(xi), which provides an efficient approach to solve SVM without the explicit use of the non-linear transformation, where K(xi,xj)=ϕ(xi)⋅ϕ(xj)is the kernel function. Application of SVM in bioinformatics problems has been widely explored [15,67-69]. At present, the publicly available LIBSVM, which take the radial basis function (RBF) as the kernel function, is employed as the implementation of SVM. RBF is defined as below\n(7) K ( X i , X j ) = exp ( - γ X i - X j 2 )\nIn this study, SVM parameter γ and penalty parameter C were optimized based on 5-fold cross validation in a grid-based manner with respect to the sequence in benchmark dataset. In this study, jackknife test is taken as the evaluation method to calculate the evaluation criteria. For a dataset with N sequences, each time, one of sequence is taken out as testing sequence and the remaining sequences are employed as training dataset. This process repeated until each sequence in the dataset is tested exactly once. The average performance over all the processes is taken as the final results.\n\nEvaluation metrics\nSensitivity (SN), Specificity (SP), Accuracy (ACC), Matthews Correlation Coefficient (MCC), Receiver Operating Characteristic (ROC) curve and the area under ROC curve (AUC) are employed in this work. All of the above measurements were calculated in the case of jackknife validation and defined as follows:\n(8) SN = T P / ( T P + F N )\n(9) SP = T N / ( T N + F P )\n(10) ACC = ( T P + T N ) / ( T P + F P + T N + F N )\n(11) MCC = ( T P * T N - F P * F N )/ ( T P + F N ) * ( T P + F P ) * ( T N + F P ) * ( T N + F N )\nIn this study, TP, FP, TN and FN donated the numbers of true positives, false positives, true negatives and false negatives, respectively. ACC denotes the percentage of both positive instances and negative instances correctly predicted. SN and SP represent the percentage of positive instances correctly predicted and that of negative instances correctly predicted, respectively. A ROC curve is a plot of Sensitivity versus (1-Specificity) and generated by shifting the decision threshold. AUC gives a measure of classifier performance. An AUC of 1.0 indicates perfect classifier whereas an AUC of classifier no better than random is 0.5. The value of MCC measures the degree of overlap between the predicted labels and true labels of all the samples in the benchmark dataset. It returns a value between -1 and +1. A perfect prediction at 100% accuracy yields a MCC of +1, whereas a random prediction gives a MCC of 0 and a terrible prediction at 0 accuracy produce a MCC of -1."}