PMC:1867812 / 2030-9994
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"17407611-12422224-1692912","span":{"begin":692,"end":693},"obj":"12422224"},{"id":"17407611-15759612-1692913","span":{"begin":1368,"end":1369},"obj":"15759612"},{"id":"17407611-16328945-1692914","span":{"begin":1776,"end":1777},"obj":"16328945"}],"text":"1 Background\nThe three dimensional (3D) native structures of proteins have important implications in proteomics. Understanding such structures enables us to explore the function of a protein, explain substrate and ligand binding, perform realistic drug design and potentially cure diseases caused by protein misfolding. The protein folding problem is therefore one of the most fundamental yet unsolved problems in computational molecular biology. One major challenge in simulating the protein folding process is its complexity. Snow et al. state that performing a Molecular Dynamics (MD) simulation on a mini-protein for just 10 μs would require decades of computation time on a typical CPU [1]. Researchers in the Folding@home project recently proposed a World Wide Web-based computing model to simulate the protein folding process [2].\nAs the volume of folding trajectories produced from high-throughput simulation tools increases drastically, there is an urgent need to compare, analyze, and manage such data. Previously, researchers have examined several summary statistics (e.g. radius of gyration, root mean square deviation (RMSD)) to identify similar 3D conformations in folding trajectories. Although summary statistics are commonly used for comparison, they can only capture biased and limited global properties of the conformation. Recently, Russel et al. [3] suggested using geometric spanners for mapping a simulation to a more discrete combinatorial representation. They apply geometric spanners to discover the proximity between different segments of a protein across a range of scales, and track the changes of such proximity over time.\nTo overcome the difficulties in managing and analyzing the large amount of protein folding simulation data, Berrar et al. [4] proposed using a data warehouse system. They embed the warehouse in a grid computing environment to enable data sharing. They also propose implementing a set of data mining algorithms to facilitate commonly needed data analysis tasks.\nIn this article, we propose a spatio-temporal mining approach to analyze folding trajectories. We extend the spatio-temporal data mining framework that we have developed earlier to analyze and manage such data [5]. This framework is designed to analyze spatio-temporal data produced in several scientific domains. Previously, we have applied this framework to analyze 8732 proteins taken from the Protein Data Bank to identify structural fingerprints for different protein classes (e.g., α-proteins) [6]. Each protein is associated with a set of objects that are extracted from its contact map. We then realize the notion of Spatial Object Association Pattern (SOAP) to effectively capture spatial relationships among such objects, Furthermore, by associating SOAPs with proteins in different protein classes, we have identified multiple types of SOAPs that can potentially function as the structural fingerprints for different protein classes. In this article, we extend such strategies to a new application domain: analyzing and characterizing the folding process of a protein.\nClearly, protein folding trajectories consist of both spatial and temporal components. Each protein in a MD simulation is composed of a number of residues spatially located in the 3D space that move over time. Each frame (or snapshot) of the trajectory can be represented as a 2D contact map, which captures the pair-wise 3D distances between residues. We extract non-local bit-patterns from these contact maps. We then use an entropy-based clustering algorithm to cluster such bit-patterns into groups. These bit-patterns are further associated to form spatial object association patterns (SOAPs). By using SOAPs, we are able to effectively summarize and analyze folding trajectories produced by MD simulations. A major advantage of this representation is its appropriateness for cross-comparison across different simulations, as discussed in later sections.\nCompared to our previous work on protein structural analysis [6,7], we have made the following contributions:\n• Propose a contact map-based approach to analyze protein folding trajectories: Our previous work focused on identifying structural signatures in native conformation of proteins in different classes or folds. Thus, there is no temporal component involved. In contrast, a folding trajectory has both spatial and temporal components. In addition, bit-patterns in a folding trajectory will interact with each other and evolve over time. Moreover, the proposed approach also effectively integrates 3D structural information in the overall analysis. This is critical in understanding the protein folding mechanism.\n• Map 2D bit-patterns in contact maps with 3D structural motifs: To better understand and explain the biological meaning of the bit-patterns in contact maps, we have made an effort to establish a mapping between such bit-patterns and well-known structural motifs (e.g., α-helices and β-turns) in 3D conformations. Currently, this task is carried out manually. We are in the process of automating this mapping. Such a 2D-3D mapping is essential to folding data analysis due to the following reasons: First, to gain insight into the folding process, it is critical to identify the formation of important local 3D motifs such as β-turns. Second, our previous studies show that by associating multiple bit-patterns in contact maps, one can construct effective structural signatures for different protein classes or folds [6]. This leads us to hypothesize that a mapping might exist between 2D bit-patterns in contact maps and 3D local motifs of a protein. In this work, we validate this hypothesis and report the mapping result later. Finally, such a mapping not only enables one to take advantage of the simplicity of working in the 2D space of contact maps, but also allows one to relate to the 3D space of protein conformations. This is important in understanding the protein folding process.\n• Indirectly capture interactions among structural motifs in 3D space: In our previous work, two bit-patterns are considered spatially proximate if they are located in the same vicinity within a 2D contact map. This is problematic in the context of protein folding, as two bit-patterns can be spatially proximate in a contact map even though their corresponding motifs are distant in the 3D conformation. (See Section 3 for more details.) We address this issue by considering the 3D distance between two bit-patterns.\n• Propose novel strategies to analyze protein folding trajectories: We propose several novel strategies to analyze protein folding trajectories based on spatial and spatio-temporal association patterns.\nIn summary, one can benefit from our mining approach in two main aspects:\n• Effective, informative and scalable representation of folding simulations: We represent each frame by a set of SOAPs, where each SOAP in turn characterizes the spatial relationship (or interactions in the folding case) among multiple bit-patterns. SOAPs are not only easily obtainable but also, as we will show, able to capture folding events along a folding trajectory.\n• Cross-analysis of trajectories to reveal a consensus partial folding pathway: By representing each frame as a set of SOAPs, one can carry out analysis across different trajectories. Such analysis includes detecting critical events and identifying consensus partial folding pathways across trajectories.\nThe remainder of the article is organized as follows. In Section 2, we describe the two proteins-BBA5 and GSGS-and their trajectories produced from computational simulation. We also identify two main goals to analyze such trajectories. In Section 3, we present a step-by-step description of our analysis approach. We next report the empirical results on analyzing the trajectories of the two proteins in Section 4. We focus on the protein BBA5. Finally we conclude and report several ongoing research directions in Section 5."}