@ewha-bio:16
Biological Network Evolution Hypothesis Applied to Protein Structural Interactome.
The latest measure of the relative evolutionary age of protein structurefamilies was applied (based on taxonomic diversity) using the protein structural interactome map (PSIMAP). It confirms that, in general, protein domains, which are hubs in this interaction network, are older than protein domains with fewer interaction partners. We apply a hypoth esis of ‘biological network evolution’ to explain the positive correlation between interaction and age. It agrees to the previous suggestions that proteins have acquired an increasing number of interaction partners over time via the stepwise addition of new inter actions. This hypothesis is shown to be consistent with the scale-free interaction network topologies proposed by other groups. Closely co-evolved struc tural interaction and the dynamics of network evo lution are used to explain the highly conserved core of protein interaction pathways, which exist across all divisions of life.
There are around 300 distinct classification schemes used to relate over 140,000 species and sub-species in the NCBI Taxonomy database (Wheeler et al., 2000) (July 2002). This tree of life' classifies species into four superkingdoms, namely: eukaryota, eubacteria, archaea and viruses. The huge diversity of life is the result of billions
of years of evolution on Earth. However, the basic core of protein mediated metabolic pathways in all these species is relatively homogeneous (Benner et a/., 1989; Morowitz, 1992; Morowitz, 1999). Furthermore, despite the continuing growth in the quantity of determined protein structures, sequences and even whole genomes, the rate of finding novel protein topologies is decreasing (Fig. 1). It is probable that there are no more than 2,000 distinct protein topologies in nature (Chothia, 1992; Orengo et al., 1994; Alexandrov et a/., 1995; Wang, 1996; Zhang, 1997). One can ask how such an ancient and diverse evolutionary history could maintain such a homogenous biochemical backbone, supported by so few protein topologies. What constraints prevent life from using unique biochemical pathways and discovering new protein folds? The proposed scale free topology of the interaction network (Jeong et al., 2000), the structural interaction network (Park et al, 2001), the closely co-evolved nature of protein interactions (Bennett et al., 1994; Marcotte et al., 1999; Fraser et al., 2002) and the rate of network evolution (Kauffman et al., 1993) contribute significantly to an account of these observations. We have proposed that protein interaction networks are conserved in evolution and highly interacting groups are relatively old and functionally important (Park and Bolser, 2001). Here, we explain it further with the latest data by using the old concept of biological Network Evolution applied to protein structural interactome.
Currently the structural classification of proteins database (SCOP) (Murzin et al., 1995) defines around 1,000 distinct protein fold types as Superfamilies (termed as structurefamilies in this paper), denoting homology between their representative protein structural domains (domain) members. A fold in SCOP is defined solely on the basis of structural similarity between domains. Structurefamilies therefore divide folds into evolutionary groups, using sequence and functional similarities. However, around 90% of the folds defined in SCOP are thought to have a single evolutionary origin, constituting a single superfamily. Structurefamilies (superfamilies) are the most useful domain classification for comparing structures and functions in bioinformatics by virtue of their structural and phylogenetic classification.
Although the rate at which new protein structures are deposited in the Protein Data Bank (PDB) (Berman et al., 2000) is increasing (Fig. 1c), the rate of discovering new structurefamilies is decreasing (Fig. 1b). The recent
conservative structurefamily assignment of 56 genomes covered between 40-67% of the total detected genes in eukaryotes and eubacteria (~ 100,000 genes) and between 31-54% of the total detected genes in archaebacteria (~ 10,000 genes) (Gough et a/., 2001). Given that a significant portion of the unassigned genes may represent trans-membrane and other proteins, not assigned to structures due to experimental difficulty in structure determination, it is reasonable to suggest that there are now enough soluble protein structurefamily data in the PDB to make a global map of structurally observed structurefamily interactions. PSI-MAP (Protein Structural Interactome MAP) (Park et al., 2001) is the first such map (Fig. 2). It also compared for the first time the protein experimental interaction information such as yeast two hybrid system (Uetz et al., 2000) with structural interaction information.
The criteria for assigning interactions in PSI-MAP is strictly structural and exhaustive; distinct pairs of domains in the PDB are denoted as interacting if they share 5 or more residue - residue contacts within 6 angstroms or less (5-5 rule of protein structure interaction). These criteria were chosen as being the most discriminative within a range of other criteria (Fig. 3). Different contact algorithms yield qualitatively similar results (Park et al., 2001). By using the SCOP domain definition (version 1.59 unless otherwise stated) it is possible that these criteria will denote covalently linked domains as interacting. These interactions (intra-interaction) are in the minority, accounting for 30% of the (11281) domain-domain interactions observed. For a breakdown of the 651 observed structurefamily-structurefamily interactions see
Table 1. The number of structurefamilies displaying interaction through both covalently and non-covalently linked domains (73 interacting structurefamilies representing 3,337 interacting domains) indicates that observed domain fusion events in the PDB are extensive. The validity of assigning the only intra-interacting structurefamily pairs as interacting is two fold. Firstly, domain proximity is a result of the selective pressure to associate genes that physically interact (Marcotte et al., 1999; Dandekar et a/., 1998; Doolittle 1999; Enright et a/., 1999). Secondly, domain proximity is more generally indicative of indirect functional associations between domains (Marcotte et al., 1999; Overbeek et al., 1999; Enright & Ouzounis, 2001). Domain fusion has been successfully used to predict protein interaction from sequence information alone (see Huynen et al., 2000) and as a hypothesis for the evolution of homo (Bennett et al., 1994) and hetero (Marcotte eta/., 1999) dimers (see Table 1 for PSIMAP multimer information). In addition, it has been observed that intra-domain interfaces have strong similarities to inter-domain interfaces within multi-domain proteins (Miller, 1989; Tsai etal., 1996; Jones eta/., 2000). Using the expert SCOP domain and superfamily definitions to predict superfamily-superfamily interactions from observed domain fusion events overcomes some of the technical problems associated with the identification of homology and fusion encountered using other computational methods to predict interaction (Overbeek et al., 1999; Enright & Ouzounis, 2001). PSI-MAP therefore represents a robust and reliable method of computationally predicting protein interaction.
It has been argued that the PDB is a fair representation
of all the soluble protein structurefamilies which may exist. However, given the combinatorial effect, it is unlikely that the PDB covers a representative set of pair-wise structurefamily interactions. Importantly, extending the repertoire of predicted structurefamily-structurefamily interactions by using structurally annotated genomic sequence data does not alter the distribution of observed interactions (Park et al., 2001; Apic et al., 2001a; Apic et al., 2001b). Thus it is likely that the relative distribution of interactions in PSIMAP (which forms the basis of our results and discussion) will reflect the distribution of a hypothetical ‘complete’ map.
Another criticism which has been levelled at PSI-MAP is that the variance in the number interactions assigned to a structurefamily could be biased by the number domains in the PDB assigned to that structurefamily. Fig. 4 shows that only a weak correlation exists between the number of domain interactions for a structurefamily and the number of unique structurefamily interactions it has. This correlation coefficient falls to 0.16 upon removal of the four most prominent outliers. A similar correlation is measured between the number of structurefamily interactions and the absolute number of domains for that structurefamily (data not shown).
It has been suggested that artificial structures in the PDB my affect the overall distribution of structurefamily interactions discussed here. PSI-MAP is constructed using only structurefamilies from SCOP class 1 to 4. In all, there are only 12 multi-domain synthetic proteins in these classes.
PSI-MAP was used to identify all the structurally observed
interactions at the structurefamily level. Structurefamilies have various degrees of Interactability’, and the interaction frequency distribution obeys a power law (Fig. 5).
To assess the functional and evolutionary differences between the most interactive and the least interactive folds, we use the latest HIINFOLD and LOINFOLD comparison sets (Park and Bolser, 2001): high interaction structurefamilies (HIINFOLD, see supplement Table A) and low interaction structurefamilies (LOINFOLD, see supplement Table B). The sixteeri HIINFOLD struc turefamilies (with at least seven other interacting partners) have functions related to glycolysis; oxidative phosphorylation; catabolism and nucleotide syntheses, well as DNA binding, replication and metabolic regulatory processes. The group contains functionally important domains, often and found in core biochemical pathways (Park et al., 2001; Apic et al., 2001 a; Apic et al., 2001 b). By contrast the 160 LOINFOLD structurefamilies (each with only one structurefamily interaction) contains only 91 (57%) structurefamilies with at least one assigned enzyme classification (see methods section for details of the functional assignment), covering a total of -130 distinct enzyme reactions.
The latest functional analysis of HIINFOLD and LOINFOLD supports the previous observation that the absolute number of protein-protein interactions correlates
with the lethality of knock out mutation (Jeong et al., 2001). Thus PSI-MAP reflects the functional importance of structurefamilies by showing number of interactions they have.
The occurrence of specific structurefamilies within different branches of the tree of life gives us information on structurefamily evolution and spread. By inference, this information also gives us the relative age of those structurefamilies (Ponting etal., 1999; Anantharaman eta/., 2001; and Snel et al., 2002 for recent examples of this general approach, also suggested by authors, Park and Bolser, 2001). We used the NCBI taxonomic database and the Swissprot (Bairoch A. & Apweiler FL, 2000) taxonomic annotation to collect this information (see methods). Simply counting the occurrence of a structurefamily at the highest level of the taxonomic tree (the superkingdom) allowed us to infer an evolutionary age (Table 2). This measure of ‘taxonomic diversity’ gives each structurefamily an
approximate relative rank age.
In general, structurefamilies with low taxonomic diversity are less likely to have interactions than those found throughout the tree of life. In combination with this observation, the average number of structurefamily interaction partners also increases with diversity (Fig. 6). As the super-kingdom level is very high, it is necessary to verify this trend at higher resolution in the future work. Similar age-interaction correlations have been reported for metabolic networks (Jeong et al., 2000; Wagner & Fell,
2001). Jeong et al. analyse the metabolic networks of 43 organisms, representing eubacteria, eukaryota and archaea. In this analysis 4% of all substrates are found to be present in all 43 organisms. These ubiquitous metabolites also represent the most highly connected substrates in the individual metabolic networks. Similarly, the “less connected substrates ••• serve as educts or products of species-specific enzymatic activities’’ (Jeong et al., 2000). Wagner and Fell concentrate on the analysis of the metabolic network of Escherichia coli. They ranked metabolites according to local and global network connectivity. The authors state that “many of the most highly connected metabolites ••• have a proposed early evolutionary origin’ (Wagner & Fell, 2001).
Recently, it was discovered that many non-centralised’ networks, including protein interaction networks, have a statistically similar connection topology (Barabasi & Albert, 1999). In these networks low and intermediate numbers of connections are common, while highly connected nodes in the network are rare but statistically significant (Dorogovtsev & Mendes, 2001). Typically, the connection distribution is described by a power law and the network is said to be ‘scale free' (Barabasi & Albert, 1999). Such networks also have the small world’ property, whereby the network diameter is significantly smaller than a random network with the same number of nodes (Watts & Strogatz, 1998). Scale free networks are optimized for the small world property, as randomly removing nodes has a very small effect on the network diameter (Albert et al., 2000). Such networks are said to be robust, as they can tolerate random deletions without changing overall connectivity (Albert et al., 2000). The structural interaction network produced by PSI-MAP has such a scale free topology (Fig. 5) (Park etal., 2001).
Using models of genetic network evolution' it has been shown that as the allosteric interactions between alleles increases, the rate of finding fitter ‘genotypes'
decreases(Kauffman, 1993). In these models inter actions’ limit the ability of a network to evolve. This rate of network evolution suggests why early, functionally important and interconnected life processes are slow to change at evolutionary time scales. Core metabolic pathways can display permutations (for example loss of specific pathways (Huynen et al., 1999)), however, the overall network does not change radically. Ancient, fundamental biochemical pathways such as the TCA cycle and glycolysis are fixed in their basic architecture early in evolution.
The conclusion that the rate of network evolution combined with the scale free network topology can account for the structurefamily age-interaction correlation is somewhat at odds with Wagner, 2001. Here gene duplication events are identified in the yeast genome, and they are used to measure of the rate of interaction formation and loss between paralogous genes. A high rate of interaction flux’ is estimated, suggesting 50% of all the network interactions change every 300 million years. This estimate is based on the assumption that the rate of interaction flux after gene duplication is indicative of the overall rate. However, there is evidence to suggest that this rate could be specifically accelerated after duplication (Long & Langley, 1993; Benton etal., 1997; Cirera & Aguade, 1998; Tsaur et al., 1998), leading to an overestimate of total interaction flux.
The results and conclusions in this paper corroborate the results of Fraser et al., 2002. Here, exactly the same principals of network evolution are used to explain an observed negative correlation between connectivity and evolutionary rate. The principals are interpreted in the biological context of reciprocal mutations and the coevolution of proteins in the interaction network.
Although the scale free topology is said to be robust to the effects of random deletion, conversely, the non random removal of the most highly connected nodes in the network rapidly fragments the network (Albert ef al., 2000). Why then do such vulnerable network topologies exist in nature? Two models of network growth have been used to account for the prevalence of scale free networks. The first, called preferential attachment (Barabasi & Albert, 1999), models an attachment bias towards already connected nodes. The second model assumes a constrained network diameter (Puniyani & Lukose, 2001) and the random attachment of nodes. The diameter of the metabolic networks from a total of 43 prokaryotes, eukaryotes and archae are all very similar (around 4), despite the varying number of metabolites and complexity of these organisms (Jeong et al., 2000). This observation is not predicted by
preferential attachment, but is implicit in the second model. It implies that the metabolic network diameter is a limiting factor in evolution. The same constraint has been suggested of protein interaction networks (Jeong et al., 2001). Both models of network growth result in the scale free topology, where old nodes accumulate more links over time (without the specific treatment of age as in Dorogovtsev & Mendes, 2000).
The secondary structure of the HIINFOLD group was mostly alpha and beta (81%, alpha&beta and alpha+beta) with only one all-alpha structurefamily (ARM repeat a. 118.1) and two all-beta superfamilies (Immunoglobulin b. 1.1 and Trypsin-like serine proteases b.47.1). The 160 LOINFOLD structurefamilies show a more even distribution among the classes (Fig. 7).
Superkingdoms were assigned to SCOP domains via the species identification codes of SWISS-PROT Protein Sequence Database (Bairoch & Apweiler, 2000; Release 39.0, May 2000). Each SCOP domain sequence in PDB90D (non-redundant SCOP domain sequences at 90% mutual sequence identity) was searched against a non redundant SWISS-PROT (90% mutual sequence identity) database. The search was done using the PDB- ISL protocol (Park et a/., 1997; Teichmann et al., 2000) for reliable structural assignments, implemented to integrate with a relational database for easy analysis. Briefly, the PSI-BLAST search algorithm (Altschul etal., 1997) is used with e-value 0.0005 and up to 10 iterations. These values
have been previously verified and are known to give less than 1% false positives (Parketa/., 1998). Each statistically significant match (e-value below 0.0005) is checked for overlap with matches from other PDB90D domain sequence with different structurefamily classifications, and these classification collisions’ are removed. Further filtering reduces the error rate even further (Park et al., 1997; Teichmann et a/., 2000). The resulting structural assignments between representative SCOP domains and proteins in SWISS-PROT give the structurefamily - superkingdom correspondence. These data were used to reliably derive the taxonomic diversity for each structurefamily(Fig. 6).
Using the same method as above each superkingdom was assigned to a list of SWISS-PROT accession numbers. These numbers give links to entries in the enzyme database via the ENZYME number. A very low e-value threshold was used to select the most reliable enzyme classifications for each structurefamily.
The latest functional analysis of high and low interaction groups showed most highly interacting structurefamilies in PSI-MAP represent functionally important enzymatic protein domains with homologues in an average of 3.6 superkingdoms. The least interacting structurefamilies in represent fewer enzymatic protein domains, occurring in an average of 2 superkingdoms.
In all, the correlation between the relative age and the interactability of protein structurefamilies is consistent with a hypothesis of network growth that proceeds via random
add-on interactions with constraints (after Puniyani & Lukose, 2001). New, specialised functions are attached to the existing network of protein interactions, and structurefamilies gradually acquire an increasing number of interaction partners throughout the course of evolution. We attribute the extremely conserved nature of core biochemical pathways to a mechanism of network evolution where relatively ancient components are under strong optimization constraints through multiple interactions (Kauffman, 1993). Thus, in general, protein struc turefamilies in central positions in the structural interaction network are more ancient than peripheral structurefamilies.
|
Annnotations
- Denotations: 0
- Blocks: 0
- Relations: 0