CORD-19:54c3a8137f74c923d699e4bd55344b88264778ca JSONTXT 11 Projects

Annnotations TAB TSV DIC JSON TextAE

Id Subject Object Predicate Lexical cue
T1 0-153 Sentence denotes Donut-shaped fingerprint in homologous polypeptide relationships-A topological feature related to pathogenic structural changes in conformational disease
T2 155-163 Sentence denotes Abstract
T3 164-362 Sentence denotes Features of homologous relationship of proteins can provide us a general picture of protein universe, assist protein design and analysis, and further our comprehension of the evolution of organisms.
T4 363-494 Sentence denotes Here we carried out a study of the evolution of protein molecules by investigating homologous relationships among residue segments.
T5 495-637 Sentence denotes The motive was to identify detailed topological features of homologous relationships for short residue segments in the whole protein universe.
T6 638-821 Sentence denotes Based on the data of a large number of non-redundant proteins, the universe of non-membrane polypeptide was analyzed by considering both residue mutations and structural conservation.
T7 822-1023 Sentence denotes By connecting homologous segments with edges, we obtained a homologous relationship network of the whole universe of short residue segments, which we named the graph of polypeptide relationships (GPR).
T8 1024-1197 Sentence denotes Since the network is extremely complicated for topological transitions, to obtain an in-depth understanding, only subgraphs composed of vital nodes of the GPR were analyzed.
T9 1198-1278 Sentence denotes Such analysis of vital subgraphs of the GPR revealed a donut-shaped fingerprint.
T10 1279-1755 Sentence denotes Utilization of this topological feature revealed the switch sites (where the beginning of exposure of previously hidden ''hot spots'' of fibril-forming happens, in consequence a further opportunity for protein aggregation is provided; 188-202) of the conformational conversion of the normal a-helix-rich prion protein PrP C to the b-sheet-rich PrP Sc that is thought to be responsible for a group of fatal neurodegenerative diseases, transmissible spongiform encephalopathies.
T11 1756-1855 Sentence denotes Efforts in analyzing other proteins related to various conformational diseases are also introduced.
T12 1857-2055 Sentence denotes Features of homologous relationship of proteins can provide us a general picture of protein universe, assist protein design and analysis, and further our comprehension of the evolution of organisms.
T13 2056-2187 Sentence denotes Here we carried out a study of the evolution of protein molecules by investigating homologous relationships among residue segments.
T14 2188-2330 Sentence denotes The motive was to identify detailed topological features of homologous relationships for short residue segments in the whole protein universe.
T15 2331-2514 Sentence denotes Based on the data of a large number of non-redundant proteins, the universe of non-membrane polypeptide was analyzed by considering both residue mutations and structural conservation.
T16 2515-2716 Sentence denotes By connecting homologous segments with edges, we obtained a homologous relationship network of the whole universe of short residue segments, which we named the graph of polypeptide relationships (GPR).
T17 2717-2890 Sentence denotes Since the network is extremely complicated for topological transitions, to obtain an in-depth understanding, only subgraphs composed of vital nodes of the GPR were analyzed.
T18 2891-2971 Sentence denotes Such analysis of vital subgraphs of the GPR revealed a donut-shaped fingerprint.
T19 2972-3448 Sentence denotes Utilization of this topological feature revealed the switch sites (where the beginning of exposure of previously hidden ''hot spots'' of fibril-forming happens, in consequence a further opportunity for protein aggregation is provided; 188-202) of the conformational conversion of the normal a-helix-rich prion protein PrP C to the b-sheet-rich PrP Sc that is thought to be responsible for a group of fatal neurodegenerative diseases, transmissible spongiform encephalopathies.
T20 3449-3548 Sentence denotes Efforts in analyzing other proteins related to various conformational diseases are also introduced.
T21 3549-3569 Sentence denotes & 2009 Elsevier Ltd.
T22 3570-3590 Sentence denotes All rights reserved.
T23 3591-4038 Sentence denotes Computational approaches, such as homology modeling (Chou, 2004a) , structural bioinformatics (Chou, 2004b; Liu et al., 2008b) , pharmacophore modeling (Sirois et al., 2004) , Monte Carlo simulated annealing (Chou, 1992) , protein subcellular location prediction Shen, 2007b, 2008) , and signal peptide prediction (Chou and Shen, 2007a; Shen and Chou, 2007) , can provide very useful information on and insight into basic research and drug design.
T24 4039-4376 Sentence denotes Since our ability to characterize the biological properties of a protein is almost exclusively based on properties conserved through evolutionary time, the study of protein evolution using computational approaches has been the focus of many researchers (Socolich et al., 2005; Russ et al., 2005; Zhang and Liu, 2008; Liu et al., 2008a) .
T25 4377-4610 Sentence denotes Characterization of the protein universe can assist in comprehending the evolvement, i.e., the formation, past, and future of proteins, in designing artificial proteins, and in providing information useful in other biological fields.
T26 4611-4926 Sentence denotes A well-known feature of the protein universe is that some folds are abundantly represented by proteins with sequence identity as low as random sequences (Rost, 1997; Sander, 1993, 1997) , whereas other folds are represented by a single sequence (Teichmann et al., 1999; Orengo et al., 1999; Holm and Sander, 1996) .
T27 4927-5208 Sentence denotes To explain this variability in fold representation, it has been suggested that a premise in convergent evolution is that folds with higher designability can be encoded and represented by more sequences (Finkelstein et al., 1995; Govindarajan and Goldstein, 1996; Li et al., 1996) .
T28 5209-5336 Sentence denotes This phenomenological notion was identified from the observation of exhaustive sequence enumeration in a lattice protein model.
T29 5337-5657 Sentence denotes Application of this notion led to remarkable progress in research in areas such as folding mechanisms (Li et al., 1998; Wolynes, 1996; England et al., 2003) , plotting of the distribution of protein populations (Taverna and Goldstein, 2000; Shakhnovich et al., 2005) , and hereditary diseases (Wong and Frishman, 2006) .
T30 5658-5766 Sentence denotes However, the designability principle often provides features of a lattice model, but not of actual proteins.
T31 5767-5862 Sentence denotes Several attempts have been made to define a more realistic picture of homologous relationships.
T32 5863-5950 Sentence denotes Dokholyan et al. (2002) offered a general picture of the universe of protein structure.
T33 5951-6199 Sentence denotes Based on structural alignments provided by the FSSP database, the authors claimed that the graph formed by proteins/vertices of a non-redundant set and connections/edges between any two structurally similar protein domains was a scale-free network.
T34 6200-6339 Sentence denotes In such a network, the probability density PðKÞ of a domain with K related structures (connection number) follows a power law PðKÞ ¼ K Àa .
T35 6340-6515 Sentence denotes Several similar studies have been performed based on sequence, structure, or both (Huynen and van Nimwegen, 1998; Yanai et al., 2000; Qian et al., 2001; Koonin et al., 2002) .
T36 6516-6575 Sentence denotes All the networks obtained have the same scale-free feature.
T37 6576-7154 Sentence denotes In fact, as a feasible method to obtain useful insights, graphical approaches have been used in the study of many biological systems, such as enzyme-catalyzed reactions (Andraos, 2008; Chou, 1989; Chou and Forsen, 1980; Kuzmic et al., 1992; Myers and Palmer, 1985; Zhou and Deng, 1984) , protein folding kinetics (Chou, 1990) , inhibition kinetics of processive nucleic acid polymerases and nucleases (Althaus et al., 1993; Chou and Kezdy, 1994) , analysis of codon usage (Chou and Zhang, 1992; Zhang and Chou, 1994) , analysis of DNA sequences (Qi et al., 2007) , among others.
T38 7155-7657 Sentence denotes Moreover, in the recent years, graphical methods have also been used to deal with many complicated biosystems, e.g., the QSAR study (Prado-Prado et al., 2008) , hard bionetwork systems (Diao et al., 2007; , hepatitis B viral infection (Xiao et al., 2006) , HBV virus gene missense mutations (Xiao et al., 2005b) , visual analysis of SARS-CoV (Wang et al., 2005) , representation of complicated biological sequences (Xiao et al., 2005a) , and identification of protein attributes (Xiao and Chou, 2007) .
T39 7658-7729 Sentence denotes Graphical approaches are a hot topic in biological and medical science.
T40 7730-7832 Sentence denotes With improvements in graphical analysis capabilities, we can obtain a more in-depth insight than ever.
T41 7833-7983 Sentence denotes For instance, in a graph of homologous relationships, the aforementioned scale-free feature only indicates distribution of the connectivity of vertex.
T42 7984-8112 Sentence denotes Even if the distributions are identical, a network may have a specific characteristic that distinguishes it from other networks.
T43 8113-8205 Sentence denotes Our interest is in identifying some in-depth features specific for homologous relationships.
T44 8206-8336 Sentence denotes To obtain in-depth and detailed features, a reasonable standard for the definition of a homologous relationship is a prerequisite.
T45 8337-8409 Sentence denotes Structure and sequence are two significant characteristics of a protein.
T46 8410-8546 Sentence denotes Since remote homologous proteins can share the representative fold of a family, structure is a more robust characteristic than sequence.
T47 8547-8635 Sentence denotes On the other hand, sequence similarity is vital in identifying homologous relationships.
T48 8636-8818 Sentence denotes As we draw a network of homologous relationship, if only structural similarity is considered, proteins without similar biological properties might be mistakenly connected by an edge.
T49 8819-8867 Sentence denotes This will result in a false detail in the graph.
T50 8868-9002 Sentence denotes Similarly, when only sequence information is considered severe differences in biology properties (e.g., structure) might be tolerated.
T51 9003-9159 Sentence denotes Thus, joint consideration of sequence and structural similarities is most appropriate for plotting a graph of homologous relationships (Qian et al., 2001) .
T52 9160-9239 Sentence denotes Biological systems have evolved from simple to complex and from small to large.
T53 9240-9449 Sentence denotes It has been proposed that short segments of polypeptides may have collapsed together to form folded protodomains in the early evolution of proteins (Trifonov and Berezovsky, 2003; Riechmann and Winter, 2006) .
T54 9450-9718 Sentence denotes Domains evolved to their modern size through the assembly and/or exchange of smaller gene segments encoding polypeptide segments of sub-domain size (Blake, 1978) , for example, by exon shuffling (Gibert, 1978) or non-homologous recombination (Bogarad and Deem, 1999) .
T55 9719-9840 Sentence denotes Thus, homologous relationships for short polypeptide segments represent an ideal aspect to investigate protein evolution.
T56 9841-10026 Sentence denotes On the other hand, in protein evolution, insertion and deletion often occur in variable region, but to a lesser degree in conserved regions that are important for biological properties.
T57 10027-10204 Sentence denotes Consequently, much progress has been achieved by matching homologous proteins with ungapped residue segments on a site-by-site basis (Smith et al., 1990; Henikoff, 1991, 1992) .
T58 10205-10443 Sentence denotes Since the alignment of ungapped residue segments retains most of the information significant for corresponding homologs, a suitable representation in characterizing homologous relationships for short polypeptide segments is also provided.
T59 10444-10614 Sentence denotes In the present study we propose a novel approach to investigate homologous relationships for proteins that provide useful information for various conformational diseases.
T60 10615-10791 Sentence denotes We used information on ungapped aligned residue segments to plot a general graph of polypeptide relationships (GPR) by jointly considering sequence and structural similarities.
T61 10792-10892 Sentence denotes Detailed analysis of the graph revealed a donut-shaped fingerprint in a vital subnetwork of the GPR.
T62 10893-11062 Sentence denotes Using the information provided by this fingerprint, we identified switch sites for conformational conversion of prion, and other conformational disease-related proteins.
T63 11063-11138 Sentence denotes We investigated homologous relationships between pairs of residue segments.
T64 11139-11276 Sentence denotes In total, 1612 non-membrane proteins from PDB_SELECT25 ([issued on 25 September 2001] Hobohm and Sander, 1994) were used in the analysis.
T65 11277-11372 Sentence denotes In this non-redundant data set, no pair of sequences shares sequence identity of more than 25%.
T66 11373-11503 Sentence denotes The solvent-accessible area for each residue was calculated using the DSSP (Kabsch and Sander, 1983 ) algorithm for every protein.
T67 11504-11570 Sentence denotes A protein sequence is treated as a succession of residue segments.
T68 11571-11716 Sentence denotes As the residue-residue correlation is notable in 15-residue segments (Liu et al., 2003) , we used a window width of 15 for further consideration.
T69 11717-11831 Sentence denotes By sliding a 15-residue window along the protein sequence, each segment of the data set serves as a query segment.
T70 11832-11939 Sentence denotes In the universe of residue segments, samples are biased, i.e., some segments are closely related to others.
T71 11940-12122 Sentence denotes To reduce this bias and to filter redundant samples and obtain a non-redundant GPR, we constructed a non-redundant target set of homolog searches fCR p4 g m for each query segment m.
T72 12123-12290 Sentence denotes This target set is a subset of our database in which each segment shares no more than four common residues (CR p4 , sequence identity is 26.7%) with the query segment.
T73 12291-12361 Sentence denotes For each query, homologs are searched in the corresponding target set.
T74 12362-12446 Sentence denotes In this way, all the segments obtained are remote homologs of the query polypeptide.
T75 12447-12640 Sentence denotes For each query segment m, we searched the corresponding target set fCR p4 g m for ungapped segments, i.e., remote homologs that are similar to the query in terms of both sequence and structure.
T76 12641-12675 Sentence denotes This was carried out in two steps.
T77 12676-12790 Sentence denotes First, multi-aligned remote homolog candidates of the query segment were initialized using a center-star approach.
T78 12791-12945 Sentence denotes Then the remote homolog candidates were optimized using a position-specific matrix, an updated scoring scheme used in evaluating homologous relationships.
T79 12946-13101 Sentence denotes For each query segment m, if the following two conditions are satisfied, we say that segment n, n 2 fCR p4 g m , is a non-redundant structural analog of m.
T80 13102-13104 Sentence denotes 1.
T81 13105-13218 Sentence denotes Structural similarity drmsðm; nÞo4Å, where the distance root mean squared deviation (drms; Park and Levitt, 1995)
T82 13220-13300 Sentence denotes for structure m and n is defined as the average distance difference drmsðm; nÞ ¼
T83 13301-13363 Sentence denotes where r ai is the coordinate of the C a atom i in structure a.
T84 13364-13366 Sentence denotes 2.
T85 13367-13590 Sentence denotes Difference in surface residue Z ¼ P 15 i¼1 dð mi ; ni Þ is at most 2 between m and n, where ai ¼ 1 for a surface residue and ai ¼ 0 otherwise, dðx; yÞ is a step function with dðx; yÞ ¼ 0 for x ¼ y and dðx; yÞ ¼ 1 otherwise.
T86 13591-13718 Sentence denotes The contribution of a residue to the folding mechanism and protein function depends on whether or not it is exposed to solvent.
T87 13719-13823 Sentence denotes To identify highly related samples, we investigated segments with similar exposed/ buried residues to m.
T88 13824-13977 Sentence denotes A surface residue is defined as one with accessible area greater than 10% of the maximum accessible surface area (Chotia, 1975) for that type of residue.
T89 13978-14063 Sentence denotes Non-redundant structural analogs of query segment m were searched in set fCR p4 g m .
T90 14064-14269 Sentence denotes In addition to structural similarity, sequence similarity can be evaluated by the knowledge of the propensity of residues to substitute for each other in homologous proteins (Henikoff and Henikoff, 1992) .
T91 14270-14381 Sentence denotes Each non-redundant structural analog n of the query segment m has a pairwise sequence alignment score Tðm; nÞ ¼
T92 14382-14467 Sentence denotes Þ is an element of the BLOSUM30 matrix for the substitution of residues a i and b i .
T93 14468-14645 Sentence denotes T is a measure of the degree of sequence similarity, and approximately corresponds to the biological homophyly between the non-redundant structural analog and the query segment.
T94 14646-14866 Sentence denotes We ranked the non-redundant structural analogs of m in descending order of score T, and used the top 30 samples as the initial remote homolog candidates in multiple sequence alignment of the subsequent optimization step.
T95 14867-15047 Sentence denotes In this step, a position-specific matrix of multiple sequence alignments was calculated for the top 30 samples. (The first matrix was calculated using sequences provided by step 1.
T96 15048-15168 Sentence denotes Updated matrices were calculated using samples produced by step 2.) The scores for a specific position i are of the form
T97 15169-15233 Sentence denotes The probability of finding residue j in column i is estimated as
T98 15234-15448 Sentence denotes where f ij is the observed frequency for residue j, a and b are the relative weights for the observed and pseudocount residue frequencies, a ¼ N is the number of aligned segments, and b is reasonably set as b ¼ 10.
T99 15449-15678 Sentence denotes The pseudocount frequencies g j ¼ P k ðf ik =p k Þq jk ,q jk are the target frequencies according to data from the BLOSUM30 matrix. p l is the background probability of the occurrence of residue l implicit in the BLOSUM30 matrix.
T100 15679-15732 Sentence denotes For a segment t, the alignment score is calculated as
T101 15733-15940 Sentence denotes We searched target set fCR p4 g m for non-redundant structural analogs (structural similarity retained) of the query segment m and ranked them in decreasing order of score T 0 (sequence similarity retained).
T102 15941-16014 Sentence denotes The top 50 segments were identified as updated remote homolog candidates.
T103 16015-16246 Sentence denotes After reranking these 50 segments using the HFnet (Hydrophobic Force network) algorithm, which boosts the quality of sequence alignment (see supporting information), the position-specific matrix was updated with the top 30 samples.
T104 16247-16320 Sentence denotes The contribution from HFnet was nearly convergent after seven iterations.
T105 16321-16405 Sentence denotes Thus, to maximize the alignment quality, a total of seven iterations were processed.
T106 16406-16530 Sentence denotes Then the final top 30 segments were selected as remote homologs, and formed the remote homolog set fR m g for polypeptide m.
T107 16531-16600 Sentence denotes In general, three factors are considered in scanning remote homologs:
T108 16601-16603 Sentence denotes 1.
T109 16604-16685 Sentence denotes Sequence identity is limited to 26.7%, so that a non-redundant graph is obtained.
T110 16686-16688 Sentence denotes 2.
T111 16689-16820 Sentence denotes Structural similarity is required during the initialization and optimization processes, so that structural information is not lost.
T112 16821-16823 Sentence denotes 3.
T113 16824-16975 Sentence denotes Sequence similarity is retained by adopting the top-ranked samples according to the BLOSUM30 homolog database and the updated position-specific matrix.
T114 16976-17070 Sentence denotes We attempted to plot a homologous relationship network for the whole universe of polypeptides.
T115 17071-17168 Sentence denotes Each query segment of our database was defined as a node of the polypeptide relationship network.
T116 17169-17259 Sentence denotes According to the above method, remote homologs were found for each of these nodes/queries.
T117 17260-17410 Sentence denotes Two nodes A and B are deemed to be related if B 2 fR A g or A 2 fR B g, where fR A g and fR B g are the remote homolog sets for A and B, respectively.
T118 17411-17521 Sentence denotes If fR A g and fR B g share no less than five segments, we say that nodes A and B are connected by edge ðA; BÞ.
T119 17522-17661 Sentence denotes Owing to our definition, each pair of connected nodes/polypeptides has similar biological properties, but a low level of sequence identity.
T120 17662-17763 Sentence denotes Consequently, we constructed a non-redundant unweighted GPR in which each edge is considered equally.
T121 17764-17863 Sentence denotes For each node, the value of connectivity K is defined as the number of edges connected to the node.
T122 17864-18016 Sentence denotes To decrease the probability of false connection, we introduced an optimization approach by counting the shared segments between two remote homolog sets.
T123 18017-18099 Sentence denotes The homologous relationship is credible if A shares enough remote homologs with B.
T124 18100-18245 Sentence denotes A low threshold for the number of shared segments results in a high level of false connections, whereas a high threshold results in more orphans.
T125 18246-18289 Sentence denotes Empirically, we recommend 5 as a threshold.
T126 18290-18355 Sentence denotes As shown in Fig. 1 , the polypeptide population fits a power law.
T127 18356-18455 Sentence denotes Other than this approximate feature, further knowledge has seldom been mentioned in the literature.
T128 18456-18615 Sentence denotes In fact, the network of homologous relationships in the polypeptide universe is so vast and complicated that many researchers have avoided a detailed analysis.
T129 18616-18740 Sentence denotes Consequently, to the best of our knowledge, detailed features of the network of polypeptide relationships are still unknown.
T130 18741-18859 Sentence denotes Here we investigated such details by analyzing the GPR character from a vital subgraph formed by significant vertices.
T131 18860-19020 Sentence denotes Using PAJEK software (http://vlado.fmf.uni-lj.si/pub/networks/ pajek/,), the representative features of a network can be analyzed by topological transformation.
T132 19021-19078 Sentence denotes In this algorithm, nodes and edges are placed in a plane.
T133 19079-19198 Sentence denotes Relative nodes are close to each other by introducing a virtual attracting force between vertices connected by an edge.
T134 19199-19333 Sentence denotes On introduction of a virtual repulsive force, all vertices are repelled from each other so that no pair of vertices can get too close.
T135 19334-19420 Sentence denotes The topological structure of a network is transformed by minimizing the system energy.
T136 19421-19498 Sentence denotes The coordinates of node clusters converge after energy minimization in PAJEK.
T137 19499-19636 Sentence denotes As a result of such topological transformations, tightly correlated nodes, i.e., homologous polypeptides, are bunched into node clusters.
T138 19637-19725 Sentence denotes We plotted subnetworks of the polypeptide relationship for nodes with connectivity K460.
T139 19726-19838 Sentence denotes The vertices of the network were colored according to the protein secondary structure (taken from DSSP database.
T140 19839-20006 Sentence denotes As in most methods, we considered three types of conformation fh; e; cg generated from the eight possible by coarse graining of h; g; i ! h, e ! e and x; t; s; b ! c).
T141 20007-20253 Sentence denotes In a polypeptide, a subsegment a i a iþ1 a iþ2 a iþ3 a iþ4 a iþ5 a iþ6 (i ¼ 0 or 8) is defined to be H if more than three of its residues are in helix conformation, E if more than three of its residues are in strand conformation, and C otherwise.
T142 20254-20352 Sentence denotes Thus, nine non-overlapping polypeptide states are defined: HH, HC, CH, EE, EC, CE, HE, EH, and CC.
T143 20353-20451 Sentence denotes The graphs obtained for these subnetworks are shown in Fig. 2 in decreasing order of connectivity.
T144 20452-20496 Sentence denotes A clear donut-shaped fingerprint is evident.
T145 20497-20604 Sentence denotes The Pajek software includes several options for clustering that differ in force model and distance measure.
T146 20605-20638 Sentence denotes We tested many different options.
T147 20639-20696 Sentence denotes The resulting donut-shaped topological feature is robust.
T148 20697-20794 Sentence denotes Helix segments and N-and C-terminal caps (HH þ HC þ CH) make up the main body of the donut shape.
T149 20795-20925 Sentence denotes Significant groups/ types of strand segments (and their caps) are not connected to the ring or to each other in subnetworks K4100.
T150 20926-21038 Sentence denotes When nodes with connectivity of up to 80 are considered (Fig. 2E,F) , such connections emerge with decreasing K.
T151 21039-21143 Sentence denotes As shown in Fig. 2E , nodes that connect the diameter of the donut shape appear at approximately K ¼ 60.
T152 21144-21233 Sentence denotes With a further decrease in K, crossings between different parts of the donut ring appear.
T153 21234-21314 Sentence denotes In other words, the ring in the GPR is connected by nodes with low connectivity.
T154 21315-21323 Sentence denotes Fig. 1 .
T155 21324-21370 Sentence denotes Node population as a function of connectivity.
T156 21371-21379 Sentence denotes Fig. 2 .
T157 21380-21445 Sentence denotes Donut-shaped fingerprint of the polypeptide relationship network.
T158 21446-21507 Sentence denotes Nodes with a connection number K greater than 60 are plotted.
T159 21508-21549 Sentence denotes Orphans in these subnetworks are omitted.
T160 21550-21596 Sentence denotes Tightly related nodes are bunched up by PAJEK.
T161 21597-21639 Sentence denotes The donut is rich in HH þ HC þ CH samples.
T162 21640-21685 Sentence denotes The arc in E is rich in EE þ EC þ CE samples.
T163 21686-21724 Sentence denotes A connected strandarc is evident in F.
T164 21725-21817 Sentence denotes In F, there is only one edge (colored in black) that connects the helix-donut to strand-arc.
T165 21818-21986 Sentence denotes We find that the subgraphs shown in Fig. 2 are not trivial nodelimited profiles of polypeptide relationships, but characterize the topological feature of the whole GPR.
T166 21987-22122 Sentence denotes As shown in Table 1 , for nodes with connectivity of K480, 7316 no-orphan nodes (Ka0) exist in the corresponding subnetwork (Fig. 2F ).
T167 22123-22407 Sentence denotes Although these nodes/segments are contributed by only 7.9% of the amino acids in our database, related (according to the definition in Subsection 2.2) or directly connected (two nodes directly connected by one edge) segments of these nodes cover residues of nearly the whole data set.
T168 22408-22730 Sentence denotes As these segments have similar biological properties to the corresponding nodes, it means that the aforementioned simple topological feature represents the nature of the whole polypeptide universe, i.e., there are two nearly separated regions in phase space of polypeptide segment: a helixdonut zone and a strand-arc zone.
T169 22731-22783 Sentence denotes The two parts are connected flimsily by sparse edge.
T170 22784-23008 Sentence denotes Although we cannot draw a picture of the whole GPR because of its extreme complexity, and only vital subgraphs can be depicted, the position of a segment in phase space of polypeptide can be deduced from secondary structure.
T171 23009-23133 Sentence denotes We assumed that HH þ HC þ CH samples belong to the helixdonut zone, whereas EE þ EC þ CE samples are in the strand-arc zone.
T172 23134-23211 Sentence denotes Then a picture of the whole graph of the polypeptide universe is constructed.
T173 23212-23288 Sentence denotes Moreover, the origin of the complicated protein universe might be very neat.
T174 23289-23494 Sentence denotes As shown by the first two rows of Table 1 , nodes shown in Fig. 2A ,B, comprising approximately 2% of the residues in our database, 'determine' the properties of nearly 80-90% of the sites in the database.
T175 23495-23634 Sentence denotes To reveal the reason for the donut shape, we selected the shape shown in Fig. 2C , a network of moderate complexity, for detailed analysis.
T176 23635-23765 Sentence denotes In this graph, the coordinates of node clusters represent the approximate position of a specific group of homologous polypeptides.
T177 23766-23893 Sentence denotes By introducing a virtual center and a clockwise angle j, samples of the donut shape in successive p=6 slices were investigated.
T178 23894-23947 Sentence denotes Polypeptides in each slice were matched site by site.
T179 23948-24178 Sentence denotes The probability densities for buried and hydrophobic residue were calculated for each site (residue classification was: hydrophobic h ¼ {M, F, I, L, V, A, W}, polar p ¼ {C, Y, Q, H, P, G, T, S, N, R, K, D, E} (Liu et al., 2002) ).
T180 24179-24331 Sentence denotes As shown in Fig. 3 , with the variation of j, a successive shift in buried/hydrophobic residues was observed for polypeptides making up the donut shape.
T181 24332-24428 Sentence denotes Thus, the distribution of buried/hydrophobic residues may be closely related to the donut shape.
T182 24429-24536 Sentence denotes Since helix forms are abundant in this shape, the period of buried/hydrophobic residues is approximately 4.
T183 24537-24636 Sentence denotes Moderate conversion of their structure is vital for the biological properties of protein molecules.
T184 24637-24767 Sentence denotes Thus, moderate changes in the structure of homologous protein are allowable, whereas a significant conversion may not be possible.
T185 24768-24943 Sentence denotes As illustrated by vital subgraphs (Fig. 2E,F) , there are a limited number of nodes in the whole GPR that form a 'bridge' between the helix-donut zone and the strand-arc zone.
T186 24944-25080 Sentence denotes This means that, in terms of protein evolution, significant structural conversion, e.g., a change from a helix to a sheet, is difficult.
T187 25081-25176 Sentence denotes Consequently, the protein universe is in a relatively steady state, with infrequent exceptions.
T188 25177-25327 Sentence denotes One well-known exception is the prion protein (PrP) that exhibits a change in structure in pathological conditions (Prusiner, 1982 (Prusiner, , 1998 .
T189 25328-25603 Sentence denotes PrP is deemed to be responsible for transmissible spongiform encephalopathies (TSEs), a group of fatal neurodegenerative diseases that are associated with conformational conversion of the normally monomeric and a-helical protein molecule, PrP C , to the b-sheet-rich PrP Sc .
T190 25604-25969 Sentence denotes TSEs arise in several mammalian species by genetic, infectious, or sporadic means, and include bovine spongiform encephalopathy in cattle, scrapie in sheep, chronic wasting disease in cervids, and Creutzfeldt-Jakob disease and kuru in humans (Prusiner, 1982 (Prusiner, , 1998 Caughey and Baron, 2006; Collinge, 2001; Aguzzi and Polymenidou, 2004; Weissmann, 2004) .
T191 25970-26099 Sentence denotes It is now widely accepted that in these protein-only diseases (Prusiner, 1982) , TSE transmission does not require nucleic acids.
T192 26100-26324 Sentence denotes As a marginally stable form between a-helix-rich and b-sheetrich states, PrP must be a protein with inbuilt polypeptides related to some 'bridge' nodes of the GPR (nodes connecting the helixdonut zone to the strand-arc one).
T193 26325-26440 Sentence denotes It is also reasonable that such inbuilt polypeptides should correlate with the origin of conformational conversion.
T194 26441-26722 Sentence denotes Although identifying a detailed mechanism for this structural conversion is beyond the scope of the present study, and the structure of PrP Sc is also largely unknown, we can apply our view of protein evolution to identify the segment in which the conformational conversion arises.
T195 26723-26865 Sentence denotes By sliding a window along the residue sequence, each 15residue segment of human PrP (hPrP, 121-230, PDB ID:1QM2) was analyzed in terms of GPR.
T196 26866-26962 Sentence denotes Vertices and edges in whole GPR were used as a framework for defining polypeptide relationships.
T197 26963-27090 Sentence denotes For a given segment of hPrP, the top 30 remote homologs were searched in whole GPR with the method described in Subsection 2.1.
T198 27091-27624 Sentence denotes By definition, these remote homologs are highly similar to the query hPrP segment in sequence and structure, i.e., they represent Count_NON, number of no-orphan nodes in a subnetwork; Coverage-NON, coverage given by no-orphan nodes in a subnetwork; Coverage-SRNON, coverage given by self and related nodes of the no-orphan nodes in a subnetwork (if B 2 fR A g or A 2 fR B g, A and B are related); Coverage-SCNON, coverage given by self and directly connected nodes of the no-orphan nodes in a subnetwork. agents of the query segment.
T199 27625-27693 Sentence denotes Nodes of the GPR directly connected to these agents were identified.
T200 27694-27884 Sentence denotes States (HH, HC, etc.) of these collected nodes indicate the probability of whether an agent belongs to a helix-donut zone (corresponding to HH þ HC þ CH) or a strand-arc zone (EE þ EC þ CE).
T201 27885-27972 Sentence denotes We assigned identified nodes to sites of the central residue of the query hPrP segment.
T202 27973-28055 Sentence denotes The frequencies of the types of nodes identified are shown in Fig. 4 site by site.
T203 28056-28310 Sentence denotes Usually the aggregation-prone regions tend to be blocked in native state of globular proteins because side chains are hidden in the inner hydrophobic core, or the cellular environment forbids the condition of the formation of aggregation (Dobson, 1999) .
T204 28311-28382 Sentence denotes A nosogenetic misfolding starts in a region where the unlocking begins.
T205 28383-28436 Sentence denotes It is like a switch of exposure of sensitive regions.
T206 28437-28609 Sentence denotes Then based on an exposure or partly exposure (Claudio, 2001) , the aggregation-prone regions might have the further chance to form amyloid in the following folding pathway.
T207 28610-28757 Sentence denotes Here we attempted to predict such switch sites in hPrP using the GPR feature of sparse connections between the helix-donut zone and the strand-arc.
T208 28758-28868 Sentence denotes We assumed that conformational conversion is due to transition between two regions of polypeptide phase space.
T209 28869-28976 Sentence denotes If polypeptides change their structures near native conformations, there will be no conformational disease.
T210 28977-29092 Sentence denotes So we should pay attention to segments which are prone to fold to structures other than their native conformations.
T211 29093-29219 Sentence denotes In Fig. 4 , except for sites of two inborn strands of approximately 130 and 160, there is a peak for EE þ EC þ CE at site 195.
T212 29220-29359 Sentence denotes With a high frequency for EE þ EC þ CE and a low frequency for HH þ HC þ CH, the two inborn strands can easily extend according to the GPR.
T213 29360-29461 Sentence denotes Due to thermal motion, a protein molecule can moderately change its conformation at room temperature.
T214 29462-29639 Sentence denotes Such facile extension of the inborn strands should be allowed by PrP C , otherwise, if it could cause disease, the corresponding life-form would have been lost during evolution.
T215 29640-29746 Sentence denotes Therefore, it is likely that such a site is not responsible for conformational changes related to disease.
T216 29747-29806 Sentence denotes On the other hand, sites around position 195 are different.
T217 29807-29886 Sentence denotes As shown in Fig. 4 , the native conformation of this region is in HH þ HC þ CH.
T218 29887-30027 Sentence denotes As these sites have a high probability of being in their inborn helix-donut state, normally it is difficult to change state to a strand-arc.
T219 30028-30190 Sentence denotes While in this special case there is reasonable probability that the polypeptide will transform to the strand-arc region, i.e., induce a conformational conversion.
T220 30191-30313 Sentence denotes Consequently, residues around position 195 (% 188-202) should be responsible for the diseaserelated conformational change.
T221 30314-30592 Sentence denotes This conclusion contrasts to earlier, largely theoretical models, and is consistent with the experimental observations of Kuwata et al. (2007) that intercalation of an anti-prion compound GN8 to regions N159, V189, T192, K194, and E196 hampers the pathogenic conversion process.
T222 30593-30750 Sentence denotes Moreover, as non-redundant polypeptides were used throughout our approach and analysis, this conclusion should be the same for all members of the PrP family.
T223 30751-30899 Sentence denotes Here we identified a simple feature of the evolution of protein molecules, and presented a general picture of the non-membrane polypeptide universe.
T224 30900-31032 Sentence denotes In the GPR there are few shortcuts connecting the diameter of a donut and 'bridges' between the helix-donut zone and the strand-arc.
T225 31033-31077 Sentence denotes Such crossing nodes are of low connectivity.
T226 31078-31150 Sentence denotes This indicates that homologous relationships generally evolve gradually.
T227 31151-31481 Sentence denotes Most polypeptides evolved strictly along a helix-donut or a strand-arc track, with very few samples exhibiting a drastic shift in biological properties during evolution, e.g. as shown in Fig. 3 , such an evolvement induces a gradual change in the distributions of buried and hydrophobic residue and thus in biochemical properties.
T228 31482-31560 Sentence denotes While it is interesting that the evolvement can final hook-up and form a ring.
T229 31561-31780 Sentence denotes Since the present work focuses on divergent evolution, it suggests that divergent evolution can result in convergent evolution at a sub-domain level, but in a gradual way that induces a donut-shaped topological feature.
T230 31781-31861 Sentence denotes It is interesting to make a second consideration of the formation of donut ring.
T231 31862-31957 Sentence denotes Shift of the distributions of buried and hydrophobic residue has a high correlation with donut.
T232 31958-32087 Sentence denotes While with the evolvement of polypeptide segment, there should be opportunity to form different groups of polypeptide structures.
T233 32088-32238 Sentence denotes Each group owns a donut-shaped fingerprint with shift of buried/hydrophobic residue, but a different way in the change of three-dimensional structure.
T234 32239-32284 Sentence denotes This would result in several connected rings.
T235 32285-32308 Sentence denotes But it does not happen.
T236 32309-32465 Sentence denotes As the GPR is a network with moderate structural deviation, a two-step evolvement may arouse severe structural change as big as that among different groups.
T237 32466-32607 Sentence denotes This can be illustrated by the insertions in Fig. 3 , where there is no obvious character in the distribution of protein secondary structure.
T238 32608-32700 Sentence denotes Consequently, the candidate donut rings final joint together, and only one ring is resulted.
T239 32701-32912 Sentence denotes In such a consideration, graph of the enlarged polypeptide segment will only correspond to further shift of the distribution of buried/hydrophobic residue, and the one ring donut-shaped fingerprint will reoccur.
T240 32913-33017 Sentence denotes Actually, we have drawn the graph of 17 and 19 length segments, and have found similar fingerprints too.
T241 33018-33192 Sentence denotes A marked difference between this study and others is that the picture obtained not only provides details of topology features, but also has direct and important applications.
T242 33193-33333 Sentence denotes According to GPR, sparse connection between the helix-donut and the strand-arc is an indicator of conformational changes related to disease.
T243 33334-33506 Sentence denotes As shown by the analysis of PrP, we can use conformational information on one state to deduce the switch sites for structural conversion related to pathological conditions.
T244 33507-33740 Sentence denotes This study can be extended to other conformational diseases, such as sickle cell anemia, antithrombin deficiency thromboembolic disease, and familial amyloid neuropathy (see supporting information; details to be published elsewhere).
T245 33741-33877 Sentence denotes Identification of the site of origin of such conformational conversions is extremely important in designing suitable therapy approaches.
T246 33878-34033 Sentence denotes By revealing switch sites for structural conversion, we can design drugs to hamper this pathogenic conversion process, or even upgrade species by mutation.
T247 34034-34243 Sentence denotes Such switch sites are usually determined by cases in which the switch role is evident, such as disease-related point mutations reported in clinic and experiments in hampering the pathogenic conversion process.
T248 34244-34286 Sentence denotes The cost of such research is considerable.
T249 34287-34452 Sentence denotes More significantly, the disease conformation was believed to be a prerequisite in previous research, which limited the number of proteins that could be investigated.
T250 34453-34559 Sentence denotes A systemic study of conformational diseases in organisms was thus beyond the scope of previous approaches.
T251 34560-34783 Sentence denotes The knowledge provided by GPR can be used to overcome the requirement for unnecessary disease structures, to predict target sites for clinical treatment, and to investigate suitable therapy schemes based on normal proteins.
T252 34784-34865 Sentence denotes This new approach could lead to great progress in curing conformational diseases.
T253 34866-35131 Sentence denotes Moreover, as demonstrated by the example described here, GPR considers both structural information and sequence identity, and thus represents a suitable strategy for meeting challenges in the design of conformational protein switches (Ambroggio and Kuhlman, 2006) .
T254 35132-35168 Sentence denotes Connections in GPR are highly exact.
T255 35169-35279 Sentence denotes As shown in Fig. 2F , none of the 109,045 edges of the subgraph make a false connection crossing the diameter.
T256 35280-35503 Sentence denotes As our aim was to provide a general picture of the whole universe of polypeptide segments, nodes and connections should be both representative and properly weighted so that the resulting feature is universal but not biased.
T257 35504-35590 Sentence denotes Consequently remote homologous relationship was selected as a feasible representation.
T258 35591-35749 Sentence denotes While in this representation, if the criterion for structural similarity is too strict, there will be a drastic decrease in the number of suitable candidates.
T259 35750-35812 Sentence denotes Thus we set the cut-off as drmso4Å, which is a moderate level.
T260 35813-35864 Sentence denotes This provides the opportunity for false connection.
T261 35865-35920 Sentence denotes However, such false connections are finally controlled.
T262 35921-35959 Sentence denotes This owed much to the HFnet algorithm.
T263 35960-36173 Sentence denotes In 2008 we suggested that the family representative intramolecular hydrophobic force networks makes a crucial contribution to the biological properties conserved throughout protein evolvement (Liu et al., 2008a) .
T264 36174-36231 Sentence denotes It uncovers the truth of protein evolution significantly.
T265 36232-36378 Sentence denotes Based on this theory, we have developed a model called HFnet to evaluate the significance of each sequence in a given multiple sequence alignment.
T266 36379-36461 Sentence denotes The power of HFnet has been proven not only in silico, but also in wet experiment.
T267 36462-36563 Sentence denotes Based on the HFnet algorithm, we have ever designed five artificial remote proteins of the WW domain.
T268 36564-36802 Sentence denotes As all of them have low pairwise sequence identity (o30%) with each other and with each proteins in the learning set, it is usually difficult to write out such sequences, and say nothing of a family sharing specific biological properties.
T269 36803-36896 Sentence denotes However, in biological experiment, four of them exhibited detectable ligand-binding affinity.
T270 36897-37069 Sentence denotes These experiment data demonstrated that our theory and the HFnet algorithm are very robust, and dominate/identify not only protein structure but also biological properties.
T271 37070-37187 Sentence denotes In the present case, HFnet algorithm contributed at least 50% increase in accuracy of remote homologs identification.
T272 37188-37320 Sentence denotes However, as only two letters were used in HFnet, for such a simple algorithm, signals for segments that are too short may be missed.
T273 37321-37389 Sentence denotes This was another consideration when selecting the 15-residue window.
T274 37390-37514 Sentence denotes Fortunately, as structural information was also considered in this work, a 15-residue polypeptide was long enough for HFnet.
T275 37515-37642 Sentence denotes With a decrease in residue-residue correlation (Liu et al., 2003) , a greater window length would cover more secondary factors.
T276 37643-37806 Sentence denotes However, as there are only two major conformations, a helix and a strand, in protein molecule, the donut-arc topological feature should not be remarkably modified.
T277 37807-38009 Sentence denotes As we have minimized the false signals during network construction, the vertices and connections in the GPR can be used as a framework that reliably represents the universe of polypeptide relationships.
T278 38010-38102 Sentence denotes The biological properties of a protein can be credibly predicted from such a representation.
T279 38103-38182 Sentence denotes This will facilitate studies of complex proteins and allow noise-free analysis.
T280 38183-38277 Sentence denotes Further improvements and applications of this representation are currently being investigated.