> top > docs > PMC:545998 > spans > 2913-2914

PMC:545998 / 2913-2914 JSONTXT

EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments Abstract Background Structure-dependent substitution matrices increase the accuracy of sequence alignments when the 3D structure of one sequence is known, and are successful e.g. in fold recognition. We propose a new automated method, EvDTree, based on a decision tree algorithm, for automatic derivation of amino acid substitution probabilities from a set of sequence-structure alignments. The main advantage over other approaches is an unbiased automatic selection of the most informative structural descriptors and associated values or thresholds. This feature allows automatic derivation of structure-dependent substitution scores for any specific set of structures, without the need to empirically determine best descriptors and parameters. Results Decision trees for residue substitutions were constructed for each residue type from sequence-structure alignments extracted from the HOMSTRAD database. For each tree cluster, environment-dependent substitution profiles were derived. The resulting structure-dependent substitution scores were assessed using a criterion based on the mean ranking of observed substitution among all possible substitutions and in sequence-structure alignments. The automatically built EvDTree substitution scores provide significantly better results than conventional matrices and similar or slightly better results than other structure-dependent matrices. EvDTree has been applied to small disulfide-rich proteins as a test case to automatically derive specific substitutions scores providing better results than non-specific substitution scores. Analyses of the decision tree classifications provide useful information on the relative importance of different structural descriptors. Conclusions We propose a fully automatic method for the classification of structural environments and inference of structure-dependent substitution profiles. We show that this approach is more accurate than existing methods for various applications. The easy adaptation of EvDTree to any specific data set opens the way for class-specific structure-dependent substitution scores which can be used in threading-based remote homology searches. Background With the sequencing of entire genomes and the exponential growth of sequence databases on the one hand, and the significant number of known folds compared to the putative number of possible folds in the fold space on the other hand, sequence-structure comparison is currently one main challenge of the post-genomic era. To this goal, 3D-environments were used by Eisenberg and coll. in the early 90s to build statistical potentials indicating the probability of finding each amino acid in a given structural environment as described by the secondary structure, the solvent accessibility and the polarity of neighboring atoms [1]. Such statistical potentials were successfully applied to protein fold recognition [1-4] or protein model evaluation [5,6], and were shown to improve the quality of sequence-structure alignments [7]. Statistical potentials describing the propensity of a residue pair to be at a given spatial distance have proved successful as well [8-14], but are more difficult to use as information to guide sequence-structure alignments using dynamic programming. On the contrary, residue preferences for position-dependent structural environments are easily implemented in alignment programs [7,15]. Recent improvements in this field were achieved by (i) optimizing definition and classification of the 3D-environments, and (ii) by constructing substitution matrices instead of residue preferences, i.e. taking into account the native residue type [15-17]. Indeed, it has been shown that amino acid substitutions are constrained by the structural environment, each environment displaying a distinct substitution pattern [18,19]. The use of 64 distinct substitution matrices corresponding to different 3D environments based on secondary structure, solvent accessibility and hydrogen bonding, combined with structure-dependent gap penalty and with global or local alignment algorithms, provides good performance to the FUGUE software in fold recognition approaches [15]. In this paper we investigate the use of decision tree algorithms to automate and improve the classification of structural environments. The automation will allow easy adaptation to any particular selected data set, opening a way for the construction of various specific substitution matrices. Indeed, it appears that one problem in the use of statistical potentials for structure prediction is their lack of universality [13]. It may thus be worthwhile to derive potentials specific to prediction problems or to protein classes. The automated derivation proposed here will facilitate such developments. In the first part of the work we focus on automatically building and evaluating structure-dependent substitution scores. The emphasis is given to the development of a method for automatic selection of the most informative classifications of 3D environments in order to set up a versatile method allowing easy compilation of structure-dependent substitution scores for any given set of proteins. In a second part, the method is applied to a specific protein class, the small disulfide-rich proteins. Decision trees have attracted our attention for several reasons. Knowledge acquisition and description language are optimized automatically and do not require human expertise during the learning process. Thanks to the hierarchical organization of the inferred trees, classifications are obtained quickly and the chain of decisions leading to the prediction is explicit and can be easily interpreted (in contrast to artificial neural networks for example). Decision tree learning algorithms are also robust since they partition the data recursively. The first dichotomies near the tree root are then based on many examples and are therefore statistically reliable. To handle noise and uncertainty, the trees can be pruned during a post-processing step to remove possible misleading clusters at the bottom of the tree. The research field on this topic is well-established and the number of applications of decision trees to real world is huge. Methods Structure-dependent substitution profiles Standard substitution matrices are deduced from multiple sequence alignments of similar sequences [20-22]. To derive structure-dependent substitution matrices, multiple sequence alignments are also needed as well as a description of the 3D structure at each position in the alignment [15,17]. A schematic overview of the EvDTree method is displayed in Figure 1. Since we want to make sure that all residues at the same position in the alignment do share similar 3D structures, we will only use multiple alignments obtained from structural superimpositions (step 1 in Figure 1). From this, we extract all observed "substitutions" for each residue type and the corresponding structural environment (step 2 in Figure 1). The word "substitution" here is used to describe the residue replacement observed at an equivalent position in two structurally similar proteins. Then, for each residue type, the structural environments and associated substitutions are classified using a decision tree algorithm and substitution scores are computed from residue distributions observed in each cluster of the classification tree (step 3 in Figure 1). Standard structure-dependent substitution matrices report the probability of 20 × 20 or 21 × 21 possible substitutions in a given structural environment [15]. In this work we classify 3D-environments and associated substitutions derived from alignments separately for each of the 21 residues (the 20 standard residues plus the half-cystine) using a decision tree algorithm (Figure 1, step 3 and Figure 2). As a result, we get several structure-dependent substitution profiles for each type of residue, that each indicates the relative probabilities of all 21 possible substitutions of one residue type in a given structural environment. Since the selected structural environments differ between residue types, the substitution profiles cannot be gathered into structure-dependent substitution matrices. As an example of how structural environments may differ between residues, a solvent-exposed environment might refer to a solvent accessibility > 6% if the residue is a leucine, but to a solvent accessibility > 33% if the residue is a glutamine. Learning and test data sets Several data sets of structure-structure superimpositions and the corresponding alignments are available [23-25]. We have selected the database of homologous structure alignments, HOMSTRAD [23] for constructing both the learning and the test data sets of sequence-structure alignments. Main HOMSTRAD strengths are (i) a selection of accurate protein structures, (ii) a combination of automatic procedures and of manual inspections that guarantee low data noise, and (iii) its successful use in the derivation of the structure-dependent substitution matrices used in FUGUE [15]. Moreover, to facilitate comparison between our method and FUGUE, we selected the very same learning set previously used by Mizuguchi et al. This subset consists of 177 families extracted from the HOMSTRAD database and does not contain membrane proteins. From this HOMSTRAD subset, a set of 2.5 million of observed substitutions were extracted, one substitution corresponding to a residue in a reference structure, and the corresponding residue observed at the same position in a structurally similar protein. Moreover, to remove non-structural constraints from the sequence-structure alignments, the following filters were applied: - Residues involved in a domain-domain or chain/chain interface, were excluded. Residues are considered to be involved in an interface when their solvent accessibility varies by more than 7 % when comparing the protein environment in the complex and in the isolated chain/domain. The cut-off value was taken from Mizuguchi et al [15], who used a similar filter to remove residues at a chain/chain interface. - Residues that are not correctly superimposed in the structural superimposition were also excluded. The superimposition was considered good enough when the deviation between the two corresponding alpha carbons is below 3.5 Å. We assume that larger deviations may correspond to incorrect structural superimposition for the particular residue even though other residues are correctly aligned. Large deviations may also imply significant modifications in the 3D-environment. Although this 3.5 Å criterion is sometimes too restrictive, it actually leaves enough data for robust statistical estimations while removing most of aligned amino acid pairs whose respective structural contexts are not superposable. Application of the two above filters excluded about 20% of the initial substitutions leaving about 2 million substitutions for the learning process. This data set was split into (i) a learning data set containing 950 000 substitutions similar to the learning set used by Mizuguchi et al. (ii) a pruning data set containing 325 000 substitutions, and (iii) a test data set containing 355 000 substitutions. The learning data set has been in some cases filtered further based on the percentage of sequence identity between superimposed proteins, resulting in smaller sets of 500 000 (0–40% id) or 700 000 (0–60% id) substitutions, respectively. - Since we only work with three-dimensional structures, the oxidation state of any cysteine (free or disulfide bridged) is known. The symbol 'C' refers to disulfide bridged cysteines (half-cystines), whereas the symbol 'J' was used for free cysteines. Structural descriptors Since the decision tree algorithm is able to automatically select the most discriminating structural descriptor at each classification step (see below) we do not need to empirically determine the 'best' descriptors. In this work, twenty-three structural descriptors were provided to the classification algorithm. The secondary structure (ss1) was assigned to each residue according to STRIDE [26] into seven categories. Values are as follows: ss1 = 1, 2, 3, 4, 5, 6 and 7 for α-helices (H), 310 helices (G), π-helices (I), isolated bridges (B), extended conformations (E), turns (T) and coils (C), respectively. We also used a simpler 3-state description (ss2) deduced from the STRIDE assignment: ss2 = 1, 2 or 3 for helices (H or G), sheets (B or E) and coils (I, T, or C), respectively. Hydrogen bonds were determined using the Hbond software [27]. Four different descriptors were used for different type of interactions: side-chain...main-chain O atom (hb1), side-chain...main-chain N atom (hb2), side-chain acceptor...side-chain donor (hb3) and side-chain donor...side-chain acceptor (hb4). For each interaction type, the number of interactions was used as the descriptor value. Here again a simpler description (lh) was also implemented that takes value of 0, 1, 2, or 3 if the side-chain of the residue makes no hydrogen bond, makes hydrogen bond(s) with side-chain atom(s), makes hydrogen bond(s) with main-chain atom(s) or makes hydrogen bonds with both side-chain and main-chain atoms, respectively. Other structural parameters were obtained using the local program compilPDB [J.G.]. Beside the secondary structure, the local structure was also described by the Phi and Psi dihedral angles, and by Cα-Cα distances: d3 = Cαi - Cαi+3, d4 = Cαi - Cαi+4, d5 = Cαi - Cαi+5, d6 = Cαi - Cαi+6, d7 = Cαi - Cαi+7. Other descriptors were the buried surface area (bur), percent of accessibility (pac), contact area with carbon atoms (C), nitrogen atoms (N), oxygen atoms (O), sulfur atoms (S), positively charged atoms (pp), negatively charged atoms (nn), or polar atoms (pol). For simplicity, these structural descriptors will now be called s1 to s23. It should be noted that some structural descriptors are correlated (e.g., the Phi and Psi dihedral angles versus the d3 and d4 alpha carbon distances). However, this descriptive redundancy is not a problem since it is eliminated during the tree construction where the most informative descriptors only are selected, as explained below. Automated classification of structural environments using a decision tree algorithm The native structural environments observed in the learning data set were classified for each of the twenty amino acids, plus the half-cystine, resulting in twenty-one independent decision trees (Figures 1 and 2). The use of these decision trees is as follows: let (a(k), s(k)) the position k in a protein for which we want to score substitutions, a(k) the residue type and s(k) = (s1(k),...,s23(k)) the structural environment description at this position. After the learning phase explained below, each tree node will be associated to particular structural descriptor sj and threshold S and will be linked by edges to two subnodes whose structural environments will be constrained respectively by the tests sj≤S and sj>S. The classification of (a(k), s(k)) will be obtained by selecting the decision tree corresponding to residue type a(k) and then by running through the tree from its root node to an appropriate leaf following at each node the edge whose test, sj≤S or sj>S, is compatible with the value of the corresponding structural descriptor sj(k) (Figure 2). Contextual substitutions scores associated to the selected tree leaf, as explained in a further paragraph, will then evaluate each possible substitution of the amino acid a(k). According to the standard data mining terminology, the predictive variables are therefore the native amino acid type and its associated structural descriptors and the dependent variable to be predicted is the substituted residue at this position. During the learning phase which we will now describe, the goal of the decision tree construction is to optimize the predictive power of the structural descriptor test chosen at each node and therefore to maximize the bias of the statistical distributions of the substituted residues associated to each subnode towards a few types of amino acids. Ideally, tree leaves should be associated to only one type of substituted amino acid, but this never happens in practice because of the tree depth limitation and the data set noise. Let (a(i), s(i), b(i)) be the i-th example of the whole learning data set where a(i) is a native residue, s(i) = (s1(i),...,s23(i)) is its structural environment description and b(i) is the substituted residue as observed in a structurally similar protein at the same position as a(i). The main steps of the decision tree construction from the learning data set are as follows (Figure 2): 1. The decision tree for a given residue type A is initiated to a unique root node with an associated cluster c0 = {i / a(i) = A } grouping all examples with native residue type A. 2. For each tree cluster c do : a. Test in turn each descriptor sj and each associated threshold S that creates possible dichotomies of c into two subclusters c1 and c2. If sj has continuous values, 9 possible thresholds S are chosen to create dichotomies c1 = {i∈c/sj(i)≤S} and c2 = {i∈c/sj(i)>S} corresponding to the 10th, 20th, ..., and 90th percentiles of the statistical distribution of the considered descriptor. If sj is restricted to a few discrete categories, all possible dichotomies c1 = {i∈c/sj(i)==S} and c2 = {i∈c/ sj(i)! = S} are created, where S is one of each possible value of sj. b. Select the optimal dichotomy from previous step which satisfies the tree constraints (see section (i) below) and minimizes the chosen splitting criterion (see section (ii) below). c. Insert the new clusters c1 and c2 as nodes in the tree by linking them to cluster c with respective edges labeled {sj(i)≤S} and {sj(i)>S} or {sj(i) == S} and {sj(i)! = S}. The structural environment associated to a particular cluster will be defined by all edge labels from the tree root to the considered tree node or leaf (see figure 2). 3. Finally, prune the tree according to the selected pruning method and pruning data set (see section (iii) below). It should be noted the choice of the optimal descriptor at a given tree level will depend on both the amino acid identity of the native residue and each structural descriptor previously chosen as splitting criteria along the tree path that leads to the considered node. Main parameters in the classification are (i) the tree constraints, (ii) the splitting criterion, and (iii) the tree pruning method. (i) Tree constraints - Tree depth: as the learning process goes deeper in the tree, more and more specific clusters are created. Beyond a certain depth, the chance that the corresponding rules can be applied to new examples outside the learning set drops significantly, resulting in an overfitting of the available data since deep clusters won't have enough associated examples to derive statistically significant distributions. Therefore, to avoid wasting time to partition the data into smaller and smaller clusters, maximum tree depths of 2 to 6 were tested. - Cluster cardinal: For the same reason as above, a minimum cardinal of examples was required for each cluster. We tested values between 200 and 1200 with increments of 200. - Tree balancing: A restriction on uneven distributions of samples among two clusters from the same parent was applied to prevent the creation of unbalanced trees which would require higher depth to fully partition the data. This restriction is achieved by the parameter simcc measuring the cluster cardinal similarity between two subclusters obtained by splitting : where, n1 is the cardinal of the subcluster 1 and n2 the cardinal of the subcluster 2. (ii) Three different splitting criteria were tested - The Gini criterion evaluates the probability that two randomly selected elements in a cluster correspond to two different types of residues [28]: where P(a|c) is the relative frequency of residue type a in cluster c. To evaluate the quality of a given segmentation into several clusters, the splitting criterion is given by where nc is the number of elements in the cluster c and n is the total number of elements in all clusters. - The Shannon entropy [29] tries to limit the distribution of elements of the same class among several clusters. where P(a|c) is the relative frequency of residue type a in cluster c. - We also used a specifically developed splitting criterion called the "mean rank" MR. Each class (residue type) in a cluster is ranked according to the number of elements of this class in the cluster (rank 1 is assigned to the most frequent residue type and rank 21 to the least frequent one). The mean rank MR evaluates the mean rank for a randomly selected element in the cluster. Low MR indicates clusters with only few well represented classes. Such clusters would correspond to structural environments that induce significant bias in the sequence and therefore strong structural constraints. where R(a|c) and P(a|c) are the frequency rank and probability of the residue type a in the cluster c. (iii) Three different pruning methods were considered - The Pessimistic Error Pruning (PEP) [30] consists in recursively checking each cluster starting from the tree root and in cutting its corresponding subtree if this removal reduces the mean error rate estimated on the independent pruning test set by : where N(c) is the number of examples assigned to the cluster c and n(c) is the number of occurrences of the most frequent amino acid in the cluster c. Let C be the father cluster from which c is derived in the tree, then c and its subtree will be removed if E(c)≥E(C). - The Mean Rank Pruning (MRP) has a principle similar to PEP, except that c will be removed if R(c)

Document structure show

projects that have annotations to this span

Unselected / annnotation Selected / annnotation
2_test 1 (1)