> top > docs > PMC:4610023 > spans > 0-73

PMC:4610023 / 0-73 JSONTXT

Event inference in multidomain families with phylogenetic reconciliation Abstract Background Reconstructing evolution provides valuable insights into the processes of gene evolution and function. However, while there have been great advances in algorithms and software to reconstruct the history of gene families, these tools do not model the domain shuffling events (domain duplication, insertion, transfer, and deletion) that drive the evolution of multidomain protein families. Protein evolution through domain shuffling events allows for rapid exploration of functions by introducing new combinations of existing folds. This powerful mechanism was key to some significant evolutionary innovations, such as multicellularity and the vertebrate immune system. A method for reconstructing this important evolutionary process is urgently needed. Results Here, we introduce a novel, event-based framework for studying multidomain evolution by reconciling a domain tree with a gene tree, with additional information provided by the species tree. In the context of this framework, we present the first reconciliation algorithms to infer domain shuffling events, while addressing the challenges inherent in the inference of evolution across three levels of organization. Conclusions We apply these methods to the evolution of domains in the Membrane associated Guanylate Kinase family. These case studies reveal a more vivid and detailed evolutionary history than previously provided. Our algorithms have been implemented in software, freely available at http://www.cs.cmu.edu/˜durand/Notung. Background Reconstruction of the history of change in a protein family provides valuable insight into processes of mutation and selection. Evolutionary reconstruction can reveal the context and order in which changes occurred, distinguish between shared history and convergent evolution, and identify interacting mutations that together result in a functional shift. Further, considering protein evolution in the context of species evolution makes it possible to correlate mutations with metabolic, physiological, and morphological changes, indicating which mutations are likely to be functionally important. Despite tremendous advances in molecular evolution and phylogenetics, methods for reconstructing the evolutionary history of multidomain families are lacking. Genes that encode this large and important class of proteins are characterized by a mosaic of sequence fragments that encode structural or functional modules, called domains. Multidomain families are central to the two-component histidine kinase signaling systems that are the backbone of cellular communication in prokaryotes. In metazoans, the expansion of multidomain families drove the evolution of cell signaling and cell adhesion. In human health, multidomain families are implicated in tissue repair, apoptosis, inflammation response, antigen recognition, and innate immunity. Multidomain families evolve via domain shuffling (Figure 1), a process that includes insertion, internal duplication, and deletion of domains. Because gain, loss, or replacement of a domain that encodes specificity can result in an immediate and dramatic change in function, domain shuffling enables rapid evolution of functional variation within gene families that perform core molecular functions. Figure 1 Schematic of multidomain evolution. (a) A hypothetical multidomain family evolving by gene duplication and domain shuffling. (b) Trees representing the history for each domain in the gene family. (c) The evolutionary history of the same family showing the domains evolving in the gene tree. Reconciliation correctly infers 2 domain duplications, 1 domain transfer and 1 domain loss. The gene family (locus) tree is shown in brown; black squares represent domain duplications. Understanding the evolution of multidomain protein function requires understanding how domain architectures change over time. Simple domain architecture (DA) models have been used to achieve the computational efficiency necessary for genome-scale analyses [1-4], and work cited therein]. In the DA model, a multidomain sequence is treated as a set or sequence of tokens (e.g. domain names or domain database identifiers [5-10]) representing its domain composition. The DA model has been used to study domain co-occurrence and variation in the domain repertoire across taxonomic lineages [11-13], plasticity in domain order [14,15], domain occurrence graphs [16,17], and domain promiscuity, i.e., the propensity of a domain to co-occur with many other domains [3,18-20]. In a phylogenetic context, patterns of domain gain and loss have been investigated by treating domain architectures as binary character data; a domain is either present or absent in a given architecture. Given a tree with architectures on the leaves, Wagner and Dollo parsimony have been used to infer ancestral domain architectures and the history of domain gains and losses [2,21-24]. Inferring multidomain trees by applying standard phylogenetic methods to domain architectures treated as character data has been proposed, either by calculating the pairwise edit distance between architectures [25] or by employing a parsimony model [21,26]. However, these approaches have not been applied in practical settings. Approaches to infer ancestral states using domain phylogenies have also been proposed [27-29], but these models have resulted in NP-complete optimization problems. The benefits of the DA model include computational and conceptual tractability. However, it relies on several unrealistic simplifying assumptions: First, the DA model ignores sequence variation within domain superfamilies, treating all instances of a domain family as indistinguishable. Second, the DA model captures change in the form of domain gain and loss, but is not sufficiently powerful to infer the events that caused the change. For example, a domain gain could result from a domain insertion or an ancestral duplication followed by losses. Without an explicit event model, it is not possible to distinguish between these two scenarios. Third, the DA model will make incorrect inferences and underestimate the degree of domain shuffling activity in the presence of parallel gains or losses, as illustrated in Figure 2. Figure 2 Domain architecture gain/loss model. Ancestral domain architectures for the hypothetical family in Fig. 1. Wagner parsimony applied to the the DA model infers 3 gains and 1 loss, underestimating the true events in the "known" history. The ancestral domain architectures inferred with Wagner parsimony are also incorrect. Here, we introduce a reconciliation-based framework for multidomain event inference. Reconciliation is the process of inferring evolutionary events by comparing the phylogenies of entities at two levels of biological organization. Given a rooted, binary gene tree, a rooted, binary species tree, and a mapping from extant genes to extant species, reconciliation seeks to infer the association between ancestral genes and ancestral species and the optimal history of gene duplications, gene losses, and horizontal gene transfers (HGTs) that explains this association. Reconciliation for the duplication-loss (DL) model was first proposed by Goodman et al. [30] and formalized by Page [31] for a parsimony model. Hallett and Lagergren [32] introduced models with transfers and proved that reconciliation under the duplication-transfer (DT ) model is NP-complete. The field has expanded with further algorithmic development for both the DL and DT models [33-40], and work cited therein]. Transfers introduce complications that do not occur in a duplication-only model. Transfers introduce degeneracy: there may be more than one optimal combination of transfers, duplications, and losses that gives rise to the same pattern of tree incongruence. Some reconciliation programs generate a single solution, selected at random. More recently, programs that generate all solutions have become available [38,41]. Transfers also introduce temporal constraints: a transfer can only occur between contemporaneous taxa. An event history is only biologically valid if it is temporally feasible; that is, if there exists a partial ordering of the nodes in the species tree that satisfies the temporal constraints imposed by the transfers in the inferred event history. There is no known constructive algorithm for generating minimum cost, temporally feasible DTL-reconciliations. Instead, dynamic programming is used to construct candidate minimum-cost reconciliations, which are then tested for temporal feasibility. A restricted model that only considers transfers between contemporaneous species, reviewed in [35,40], avoids this problem, but requires a tree with branch lengths in units of time, which are frequently difficult to estimate. Here we present an event inference algorithm for reconstructing domain evolution. We define a set of domain shuffling events that includes domain duplication, domain loss, domain insertion within the same genome, and horizontal domain transfer between different genomes. We also consider a restricted model that allows domain insertion, but not horizontal transfer of genes or domains. In the context of this model, the events in the history of a multidomain family are inferred by reconciling a domain tree with a gene tree that has been previously reconciled with a species tree. This procedure also yields the timing of those events relative to gene and species divergences, as well as ancestral domain content. Consideration of the co-evolution of domains with both genes and species enables our algorithm to distinguish between domain losses and gene losses, and between domain insertions within the same genome and domain transfers across genomes. Further, our algorithm can determine whether a domain tree co-divergence is due to a species divergence (i.e., a speciation) or a gene divergence (i.e., gene duplication or transfer). In order to ensure the biological validity of the inferred evolutionary histories, we present criteria for temporal feasibility that capture the complex temporal constraints associated with reconciliation involving domains, genes, and species. The implementation of these algorithms is freely available at http://www.cs.cmu.edu/˜durand/Notung. To illustrate the inferential power of our approach, we apply our methods to the Membrane associated guanylate kinase (Maguk) family. Results and discussion Model Given a rooted, binary gene tree, TGS = (VG, EG), a rooted, binary domain tree, TD = (VD, ED), and a mapping, MLDG from L(TD), the leaves in the domain tree, to L(TGS), the leaves in gene tree, the goal of domain shuffling event inference is to infer the association between ancestral taxa in the species, gene, and domain trees and the set of events that best explains this association. In our model, we define the following set of events: Co-divergence (C ) A bifurcation in the domain tree that arose through a bifurcation in the gene tree. The gene tree bifurcation may have arisen via speciation (CS), gene duplication (CD), or gene transfer (CT). Duplication (D)A single domain is copied resulting in two separate copies of the domain within the same gene. Loss (L)A domain is deleted from the gene (and genome). Domain insertion (I) A new copy of the domain is inserted into a different gene within the same genome. Horizontal domain transfer (T)A new copy of a domain is inserted into a gene in a different genome. Reconciliation is the process of inferring MDG : VD → VG, the association between ancestral domains and ancestral genes. The result is a reconciled domain tree, TDG = (VD, ED), in which every node, d ∈ VD, is annotated with MDG(d) = g, where g is the ancestral gene that contained domain d; Ld, the domains lost on the edge leading to d; and E(d), the event that caused the divergence at d. Co-divergences and domain duplications correspond to internal nodes in TDG. Each insertion and transfer corresponds to an edge, (d1, d2) ∈ ED, where d2 is the inserted (or transferred) domain and d1 is the donor domain. In a parsimony framework, the cost of a reconciliation is κ=∑∀εκε⋅nε, where κε is the cost of event ε and nε is the number of occurrences of ε in the reconciliation. Formally, we define the problem of inferring a domain event history as follows: Domain Event Inference with Transfers (DE-DTL) Domain events: {CS,CD,CT,D,T,I,L}. Input: A rooted, binary domain tree, TD = (VD, ED); a rooted, binary, DTL-reconciled gene tree, TGS; and a leaf mapping, MLDG:LTD→LTGS. Output: The set of all temporally feasible, domain shuffling histories TDG that minimize κ = κ D n D + κ T n T + κ I n I + κ L n L . Solving DE-DTL entails challenges that do not arise in gene tree-species tree reconciliation. First, when a domain co-divergence is inferred, additional information is required to determine whether the co-divergence is the result of a speciation, a gene duplication, or a gene transfer. Second, domain insertions are horizontal events between genes in the same species. In contrast, domain transfers are horizontal events between different species. An extra test is needed to ensure that the correct event is inferred. Third, a missing taxon in the domain tree may be due to a domain loss or the loss of a gene. If the latter, then the loss should not be included in the event cost of the domain-gene reconciliation. Finally, when testing for temporal feasibility, temporal constraints arising from gene transfers, domain transfers, and domain insertions must all be considered. Notation: Given a tree Ti = (Vi, Ei), ρi designates the root of the tree. For nodes u, v ∈ Vi, p(v) denotes the parent of v and l(v) and r(v) denote the left child and right child of v, respectively. If u is on the path from v to ρi, then u is an ancestor of v, designated u ≥i v, and v is a descendant of u, designated v ≤i u. If v≱iu and u≱iv, then u and v are incomparable, designated u≶iv. Given a reconciled gene tree TGS and a reconciled domain tree TDG, s = MGS (g) is the species s ∈ VS that contained gene g ∈ VG and g = MDG (d) is the gene g that contained domain d ∈ VD. The species containing d is s = MDS (d), where MDS (d) = MGS (MDG(d)). Domain shuffling with transfers and insertions Here, we present an algorithm for the domain event inference problem for multidomain families evolving according to a locus model [42], in which novel domain arrangements arise through internal duplication, loss, and insertion of domains into an existing gene. This restriction justifies the premises that the history of the family as a whole can be described by a tree. This assumption is consistent with the existence of promiscuous domains that lend themselves to insertion in new chromosomal environments [1,3,43,44] and reports of young genes that arose through duplication of existing genes, followed by acquisition of additional domains [2,45-48]. Moveover, domain insertion into an existing gene is more likely to be viable since all regulatory and termination signals required for successful transcription are present. In addition, we assume that domain insertions and transfers only involve domains within the same gene family. In other words, for a given domain family, we assume that the domain instances that appear in the gene family under consideration form a clade in the domain tree. In the context of this model, we introduce an algorithm (Alg. 1) for inferring domain events by reconciling a domain tree with a gene tree that has been previously reconciled with a species tree. Algorithm 1 DE-DTL Input: TS; TGS; TD; MLDG:LTD→LTG Output: T D G 1 … T D G f , κ 1     TGS*=addLoss(TGS,TS) 2     costCalcρD,TGS*,TS 3     TDG1…TDGm=tracebackρD,TGS*,TS 4     TDG1…TDGf=checkFeasibilityTDG1…TDGm The algorithm proceeds in four steps. First, an additional data structure, TGS*, is constructed. TGS* consists of an extended reconciled gene tree that contains additional nodes and leaves representing taxa that are missing due to gene losses. This additional data structure is used to determine whether or not the donor and recipient of a horizontal event are in the same genome or a different genome, and to distinguish between domain losses and gene losses. Next, candidate reconciliations are generated in two passes over the domain tree. In the first pass, the dynamic program costCalc visits each d ∈ VD in post-order and determines the cost of the subtree rooted at d for each possible event at d. This information is stored in the cost and event tables, Kd and Hd. In the second pass, traceback traverses TD top-down to generate candidate reconciliations from the information stored in the cost and event tables. In general, there may be more than one optimal set of domain events that reconcile TD with TGS. The second pass generates all candidate event histories of minimum cost, TDG1…TDGm, where m is the number of candidate histories. In the final step, each candidate history is tested for conflicting temporal constraints. The output is the set of all temporally feasible histories, TDG1…TDGf, where f is the number of feasible, optimal reconciliations. Our domain shuffling event inference algorithm is based on the existing framework for gene-species tree reconciliation with a DTL model [36,49], but contains additional features to address the complications that arise in reconciliation with three nested trees. We discuss these features in detail, here. addLoss: We construct TGS*=VG*,EG* from TGS by placing pseudonodes on each edge (p(g), g) ∈ EG, on which losses occurred. Each pseudonode represents an ancestral gene that was present in the species lineage from MGS (p(g)) to MGS (g), but cannot be observed because of a gene loss in one child of each of the intervening species. Let s1⋯sl be the species between MGS (g) and MGS (p(g)) that are absent from TG due to gene losses. We insert pseudonodes ϕs1⋯ ϕsl between g and p(g) such that ϕs1becomes the new parent of g, ϕsi is the parent of ϕsi−1for i = 2 ⋯ l, and p(g) becomes the new parent of ϕsl. For each pseudonode, ϕs, we attached a pseudoleaf, λs, where s′=ls, if MGS(g)≶l(s), and s′=rs, otherwise. In other words, S' is the child of s that is not on the path from MGS (g) to MGS (p(g)) and s′,s is the species tree branch on which the loss occurred. Note that a pseudoleaf in TGS* may correspond to an internal node in TS, because if a gene is missing from an entire clade of species, then the most parsimonious explanation is a single gene loss in the root of the clade. The addition of pseudonodes allow for more precise estimation of the association between a node in the domain tree and a lineage in the gene tree. Consider, for example, the evolutionary history of the hypothetical family shown in Figure 3, and the reconcilied gene and domain trees for this family, shown in Figure 4. We can infer that the ancestral domain associated with node u in TD was in a genome on the Eudicot-Rosid lineage because of the position of the pseudonode ϕR, between u and d1_g1_A. Without the pseudonode, it would not be possible to determine whether or not u predates the divergence between Apple and Berry. Figure 3 Evolution of a hypothetical multidomain gene family. Domain instances are represented by grey squares. Duplicated genes in the same genome are connected by dotted lines. Figure 4 Domain insertions in the presence of gene loss. (a) Embedded trees showing the co-evolution of domains, genes, and species in the hypothetical family in Fig. 3. (b) An extended reconciled gene tree for the gene family. Inferred gene losses (ℓB and ℓC ) and pseudonodes (open circles, ϕR and ϕE ) representing the location of the missing taxa in the gene tree are shown in grey. The pseudonodes are used to distinguish between gene losses and domain losses in the reconciled domain tree. Gene duplication is represented as a black square; filled circles represent co-divergences. (c) A reconciled tree for the domain family, showing a domain insertion (arrow, edge (u, v)) and a domain loss (d2 g2 B). Domains that are missing due to gene loss (ℓB and ℓC) are shown in grey. Co-divergence due to gene duplication is represented by an open square. Pseudoleaves are also used to distinguish between gene and domain losses. The domain tree in Figure 4(c) has three stubs indicating missing taxa. Two of these are associated with gene losses (ℓB and ℓC), indicating that only one of the missing domains is due to a domain loss (d2_g2_B). costCalc: Once the extended reconciled tree has been constructed, the first pass takes TD and TGS* as input and calls costCalc(d) for each d ∈ VD in post-order (Alg. 2). The gene association and event at d depend on gl and gr, the genes associated with the children of d. The outer loop in cost Calc enumerates all possible (gl, gr) pairs, and for each pair determines the gene associations and events implied by MDG(l(d)) = gl and MDG (r(d)) = gr . The cost of each configuration is stored in Kd. A tuple consisting of the event and the mappings of the children of d are stored in Hd. The logic for these assignments is as follows. Domain duplication is the only event that results in children mapped to comparable genes. Thus, if gl and gr are comparable, then ε=D and the gene associated with d is g = lca(gl, gr). Horizontal events and co-divergences both result in gl≶gr. Therefore, if gl and gr are incomparable, both possibilities are considered. For the co-divergence case, the associated gene is again g = lca(gl, gr). The type of co-divergence is determined by inspecting the gene event associated with gene node g in TGS*. For the horizontal case, the event is a domain insertion if the donor and recipient gene are in the same genome and a domain transfer if they are in different genomes. For both insertions and transfers, either gl or gr may be associated with d; i.e., either gl or gr could be the donor of the domain. For each scenario, the cost of domain losses must also be determined, excluding cases where domain loss is due to gene loss. Recall that pseudonodes correspond to gene losses. Therefore, the number of losses, nL, on the edge from d to p(d) is the number of non-pseudonodes between g = MDG(d) and gp = MDG(p(d)). If g and gp are not pseudonodes, then this quantity is Δ(g) − Δ(gp) − 1, where Δ (g) is the depth of g in the original gene tree, TG. However, if g is a pseudonode, ϕ, then Δ (g) is undefined. We define Δ (ϕ) to be the depth of the first non-pseudonode ancestor of ϕ in TGS*. In other words, Δ (ϕ) = Δ (u), where u ≥G ϕ and there exists no v ∈ VG, such that u ≥G v ≥G ϕ. This effectively jumps over all pseudonodes between ϕ and u. If g is a pseudonode and its first non-pseudonode ancestor is a loss, then directly using the depth of the first non-pseudonode ancestor will fail to count this loss. In this case, the number of losses is incremented by one (lines 15 and 16). No such correction is needed when gp is a pseudonode. This formulation allows for efficient calculation of the number of losses under each scenario considered in costCalc. If E(d) is a co-divergence, then the number of losses is (Δ (gl ) − Δ (g) − 1) + (Δ (gr) − Δ (g) − 1). If E(d) is a duplication, gl, gr, and g should all be at the same depth, and the number of losses is (Δ (gl ) − Δ (g)) + (Δ (gr) − Δ (g)). traceback: Once the first pass is complete, the cost and event tables are filled for each node in TD. The traceback algorithm constructs a minimum-cost reconciliation using these tables in a pre-order traversal. At every node d ∈ VD, the appropriate tuple in the event table Hd is used to assign an event to E(d) and determine the labels of the genes associated with the children of d. The losses that occurred between d and p(d) are inferred in a climbing procedure [38,50]. For each ancestral node g that is missing between MDG (d) and MDG(p(d)), a loss is inferred in g', the child of g that is incomparable to MDG(d). If g is a pseudonode in TGS, then g' is a gene loss, λs. Otherwise, a domain loss in g' is inferred. Algorithm 2 DE-DTL: CostCalc Input:TGS*; TD; MLDG:LTD→L(TG) Output: Kd, Hd∀d∈VD; ν costCalc(d) { 1 if d ∈ L(TD) { 2    for each g∈VG* { 3        if MLDG(d)≶g {table(d, g, ∞, ∞, null, null) } 4            else { 5                 n L = Δ M L D G ( d ) - Δ ( g ) 6                table(d, g, CS,nL, null, null) 7            } 8        } 9     return 10 } 11 costCalc(l(d)), costCalc(r(d)) 12 for each gl,gr∈VG*×VG* { 13 g = lca(gl, gr) 14 n L = Δ g l + Δ g r - 2 ⋅ Δ g + 1 15 if(gl is pseudonode) { nL + + } 16 if(gr is pseudonode) { nL + + } 17 if NOT gl≶gr { 18     // Duplication case; 19     ε←D 20     nL=nL+2 21     table(d, g, ε, nL, gl, gr) 22 } 23 else { 24     // Co-divergence case; 25     if E(g)=Dε←CD           // Duplication 26     else if E(g)=Tε←CT   // Transfer 27     else { ε ← CS }                     // Speciation 28     table(d, g, ε, nL, gl, gr) 29     // Domain insertion or transfer case; 30     if MGS (gl ) = MGS(gr) {          // Same genome 31         // Domain insertion case; 32         table(d, gl, I, 0, gl, gr )      // gl to gr 33         table(d, gr, I, 0, gl, gr )     // gr to gl 34     } else if MGS(gl)≶SMGSgr{ 35         // Domain transfer case; 36         table(d, gl,  T, 0, gl, gr ) // gl to gr 37         table(d, gr,  T, 0, gl, gr ) // gr to gl 38     } 39   } 40 } table(d, g, ε, nL, gl, gr ) { 41 κ←κ(ε)+Kl(d)gl+Kr(d)gr+κL⋅nL 42 ifκ

Document structure show

projects that have annotations to this span

There is no project