3 Algorithm In this section, we describe in detail the proposed approach for analyzing protein folding trajectories. As shown in Figure 4, such an approach consists of three main phases: (I) Data preprocessing, (II) Spatio-temporal object association pattern mining, and (III) Trajectory analysis. We next discuss each phase in further details. Figure 4 Algorithm. Main steps of summarizing and analyzing protein folding trajectories. 3.1 Data Preprocessing Same as in our previous studies on protein structural analysis [6,7], we represent 3D protein conformations by contact maps. In order for this algorithm to be self-contained, we next briefly go over these preprocessing steps. We also explain the rationale of such steps in the context of protein folding. Contact Map Generation When generating contact maps, we consider the Euclidean distances between α-carbons (Cα) of each amino acid. Two α-carbons are considered to be in contact if their distance is within 8.5 Å. Thus, for a protein of N residues, its contact map is an N × N binary matrix, where the cell at (i, j) is 1 if the ith and jth α-carbons are in contact, 0 otherwise. Since contact maps are symmetric across the diagonal, we only consider the bits below the diagonal. Furthermore, we also ignore the pairs of Cα atoms whose distance in the primary sequence is ≤ 2, as they are sure to be in contact. This step transforms the two BBA5 trajectories into two series of contact maps, with each contact map of size 23 × 23. By the same token, the 5 GSGS trajectories are transformed into 5 sequences of contact maps. Identifying Maximally Connected Bit-patterns Every bit in a contact map has eight neighbor bits. For an edge position, we assume its out-of-boundary positions contain 0. In a contact map, a connected bit-pattern is a collection of bit-1 positions, where for each 1, at least one of its neighbors is 1. Correspondingly, we define a maximally-connected bit-pattern (also referred to as a bit-pattern in this article) to be a connected pattern p where every neighbor bit not in p is 0. We apply a simple region growth algorithm to identify all the maximally-connected patterns in each contact map within the two series of contact maps, corresponding to the two folding trajectories of BBA5. Altogether, we identified 352 maximally-connected bit-patterns in such contact maps. For the GSGS folding data, a total of 50,572 unique bit-patterns are constructed. We then represent each identified bit-pattern as a 6-tuple feature vector consisting of the following attributes: • Height: the number of rows contained in the pattern's Minimum Bounding Rectangle (MBR). • Width: the number of columns in the pattern's MBR. • NumOnes: the number of 1s in the pattern. • Slope: the general linear distribution trend of all the 1s in the pattern within its MBR. To compute the angle of a connected pattern we use the least-squares method to estimate the slope of a linear regression line. For a pattern containing n 1s, we denote the positions of the 1s as: (x1, y1)...(xn, yn). The least-squares method then estimates the slope β1 as: β1=∑i=1n((xi−x¯)∗(yi−y¯))/∑i=1n((xi−x¯)2) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFYoGydaWgaaWcbaGaeGymaedabeaakiabg2da9maaqadabaGaeiikaGIaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWG4baEgaqeaiabcMcaPiabgEHiQiabcIcaOiabdMha5naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmyEaKNbaebacqGGPaqkcqGGPaqkcqGGVaWldaaeWaqaaiabcIcaOiabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmiEaGNbaebacqGGPaqkdaahaaWcbeqaaiabikdaYaaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaaa@5A04@ • xStdDev: the standard deviation of all the 1s' x-coordinates (this quantifies how the 1s spread along the x dimension). • yStdDev: the standard deviation of all the 1s' y-coordinates. Note that this feature vector captures the main geometric properties of a bit-pattern. As discussed in the literature [18-21], non-local patterns (where bit-patterns are one type of non-local patterns,) in contact maps can effectively capture the secondary structure of proteins. Our previous work [6,7] demonstrated that by characterizing the spatial relationship among the above described bit-patterns, one can construct structural signatures for proteins of different classes or folds. In the context of protein folding, we have observed that the above-defined bit-patterns are also capable of capturing a wide range of local 3D structural motifs. They can even approximately measure the strength of secondary structure propensity in a conformation. For instance, we have identified bit-patterns that correspond to "premature" α-helices and native-like α-helices respectively. Henceforth, we refer to the 3D structure formed by all the participating residues of a bit-pattern as the 3D motif of the bit-pattern. The relationship between bit-patterns and 3D motifs will be further discussed in the next section. Clustering Bit-patterns into Approximately Equivalent Groups In this step, we partition the extracted bit-patterns into approximately equivalent groups, each of which consists of bit-patterns that exhibit similar geometric properties (e.g., shape and size). To construct such equivalent groups, we run the k-means based clustering algorithm [22] over the bit-patterns' corresponding feature vectors, where k is the number of clusters (or equivalent groups) that will be produced. To determine an optimal value of k, we take the following three steps. First, we run the clustering algorithm on different k values. This produces different clustering schemes for the same set of bit-patterns. Second, for each clustering scheme, we compute its entropy. Let c1, ..., cl be the l clusters after clustering the set of N bit-patterns. Furthermore, each cluster ci (1 ≤ i ≤ l) has an individual entropy Hi and contains Ni elements, then the total entropy of this clustering is given by the following formula: H=∑i=1kHi∗(Ni/N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGibascqGH9aqpdaaeWaqaaiabdIeainaaBaaaleaacqWGPbqAaeqaaOGaey4fIOIaeiikaGIaemOta40aaSbaaSqaaiabdMgaPbqabaGccqGGVaWlcqWGobGtcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdUgaRbqdcqGHris5aaaa@3F89@ The entropy of each individual cluster, i.e., Hi , is computed by summing up the entropy of each of the six bit-pattern attributes such as its height and width. For an attribute, we compute its entropy in a cluster according to the procedure explained by Shannon [23]. In the third and final step, we plot the entropy against the number of clusters, i.e., k, and choose a value k where the entropy plot begins to show a linear trend. For the BBA5 folding data, this clustering step groups the 352 bit-patterns into 10 clusters (or types). As for the GSGS data, 12 clusters are identified. Intuitively, the 3D motifs of the bit-patterns in a cluster will also have similar 3D geometric properties. This is verified based on our manual analysis on the BBA5 trajectories. Figure 5 illustrates the representative 3D motifs.corresponding to the 9 of 10 types of bit-patterns identified in BBA5 trajectories. We omit type 0, as bit-patterns of this type, unlike the others, correspond to a wide variety of 3D motifs. Figure 5 Mapping between 2D bit-patterns and 3D sub-structures. The figure visualizes the representative 3D sub-structures corresponding to the 10 classes of bit-patterns identified in the contact maps along BBA5's two folding trajectories. The bit-patterns shown here are randomly selected from their respective group for illustration purpose. We also observed a similar scenario for the 12 types of bit-patterns identified in the GSGS trajectories. For instance, the typical 3D motifs of type 0 bit-patterns resemble the native conformation of GSGS (See Figure 2(a)); whereas those of type 6 identify with α-helices. Upon a closer look at this 2D-3D mapping illustrated in Figure 5, one can observe the following interesting aspects. First, multiple types of bit-patterns can be associated with a single type of 3D motif. For instance, there are 3 types of bit-patterns are mapped to an α-helical motif. Second, contrary to a commonly accepted belief that β-turns or β-sheets cannot be captured by maximally connected bit-patterns as defined earlier, our analysis shows that this belief does not stand. To illustrate this point, we take two examples. The first example, illustrated in Figure 6, corresponds to the β-turn structure. As shown in Figure 6(b), the β-turn formed by the first 10 Cα atoms of BBA5 can be captured by the maximally connected bit-pattern shown in Figure 6(a). The second example, shown in Figure 7, illustrates that a two turn β-sheet (Figure 7(b)) can also be captured by a bit-pattern (Figure 7(a)). Finally, not every type of bit-patterns can be mapped to a typical 3D motif. This might be attributed to our entropy-based criteria for selecting an "optimal" value of the parameter k in the clustering task. Figure 6 β-turns vs. maximally connected bit-patterns: an example. (a) A type 8 bit-pattern is identified in the 166th frame of the BBA5 T23 trajectory. This bit-pattern corresponds to the the connected 1s in the table, where a '1' indicates two corresponding Cα atoms are in contact,'-' otherwise. This pattern consists of the first 10 Cα atoms. (b) The 3D conformation of this frame, where the first 10 Cα atoms resembles a β-turn. Figure 7 β-sheets vs. maximally connected bit-patterns: an example. (a) A type 0 bit-pattern is identified in the 24201th frame of the GSGS T1 trajectory. This bit-pattern corresponds to the the connected 'x'-es in the table, where an 'x' indicates two corresponding Cα atoms are in contact,'-' otherwise. It consists of Cα atoms from 5 through 20. (b) The 3D conformation of this frame, where the 5–20 Cα atoms resembles a β-sheet of two turns. This demonstrates, to a certain extent, the advantage of using 2D contact maps to analyze 3D protein conformations. Undoubtedly, using contact maps greatly reduces the computational complexity, though at the cost of loss in structural information. However, some of this information loss is re-compensated by mapping bit-patterns to structural motifs in 3D conformations. More importantly, by exploiting different features in contact maps (bit-patterns in this work), we are able to connect 2D features with features in 3D space. In the BBA5 case, by identifying 10 types of bit-patterns in contact maps, we indirectly recognize 10 different 3D structural motifs in the folding conformations. Re-labeling Bit-patterns with The Corresponding Cluster Label In this step, we re-label all the previously identified bit-patterns with their corresponding cluster label. Let p be a labeled bit-pattern. It can be represented as follows: p = (trajID, frameID, listCα, label). Here, trajID identifies a folding trajectory, and frameID indicates the frame where p occurs, listCα consists of all participating α-carbons of p, identified by their position in the primary sequence. Finally, label is the cluster label of p. For BBA5, label ∈ {g0, g1, ⋯, g9}, corresponding to the 10 approximately equivalent groups (or types). 3.2 Mining Spatio-temporal Object Association Patterns The preprocessing steps transform a 3D protein conformation into a set of labeled 2D bit-patterns, that indirectly capture the local 3D structural characteristics of the conformation. For the two BBA5 trajectories, each conformation contains an average of 6 bit-patterns. As for the five GSGS trajectories, the average number of bit-patterns in each conformation is 4. As BBA5 and GSGS fold, the dynamics among their residues is constantly changing until it reaches an equilibrium. This means that two residues previously in contact may become out of contact later. As a result, bit-patterns present in one conformation may be absent in the next. The evolving nature of contacting residues and in turn bit-patterns, is essentially the consequence of a variety of weak interactions among amino acids at different levels. Such weak interactions include hydrogen bonds, electrostatic interactions, van der Waal's packing and hydrophobic interactions [24]. To capture these (potential) interactions, a simple yet effective method is to consider how close two amino acids are located from each other in 3D. We also adopt this method here. Specifically, we consider interactions between local 3D motifs captured by labeled bit-patterns. We denote such interactions as "interactions among bit-patterns". Let pi and pj be two bit-patterns in a protein conformation, and pi.listCα and pj.listCα be the list of α-carbons involved in pi and pj, respectively. We define piand pj as interacting bit-patterns if at least one pair of α-carbons, each from pi.listCα and pj.listCα are located within a short distance δ. Note that the value of δ should be greater than the distance that is being used to identify contacting α-carbons when generating contact maps. In our analysis, we set δ = 10 Å. It is noteworthy that the above notion of interacting bit-patterns is new compared to our previous work, where two bit-patterns are associated if their distance in the 2D contact map space is below a certain threshold. This can be misleading in the context of protein folding analysis. As demonstrated in Figure 8, the two bit-patterns-BP #1 and BP #2-are only 2 amino acids away in the 2D contact map. However, they can be relatively far apart in 3D. On the other hand, although the bit-patterns BP #2 and BP #3 are relatively far apart from each other in the 2D contact map, they are close to each other in 3D. Therefore, measuring the distance between bit-patterns in the actual 3D conformation is more robust with respect to capturing potential interaction among local motifs. Figure 8 Discrepancy between distances in 2D and 3D spaces. Bit-patterns that are close to each other in the 2D contact map space, for instance, BP#1 and BP#2, can be distant from each other in 3D. Similarly, bit-patterns that are distant in 2D space, for instance, BP#1 and BP#3, can be close to each other in 3D. So far, we have discussed our approach of using bit-patterns in contact maps to characterize local 3D motifs and further represent a protein conformation during folding. We also define the notion of interacting bit-patterns in the folding context. We are now ready to present our method of summarizing folding trajectories to fulfill the two objectives described in Section 2.3. The main idea is that we can summarize a folding trajectory by characterizing the evolutionary behavior of interactions among different types of bit-patterns and in turn, the interactions among local 3D motifs. Definition of (minLink = 1) SOAP As proposed in our previous work [5,25], such interactions can be modeled and captured by discovering different types of spatial object association patterns (SOAPs). Essentially, SOAPs characterize the specific way that objects, bit-patterns in this case, are interacting with each other at a given time. Among the proposed SOAP types, after a careful evaluation, we empirically select (minLink = 1) SOAPs to model the interacting bit-patterns in the folding process. Let p = (g1, g2, ⋯, gk) be a (minLink = 1) SOAP of size k, where gi is one of the 10 types of bit-patterns described above. In the context of folding trajectories, p prescribes that there exists k bit-patterns b1, b2, ..., bk in a conformation, where bi.label = gi (1 ≤ i ≤ k). Furthermore, for each bi, it interacts with at least one of the remaining (k - 1) bit-patterns. Note that the k labels in p are not mutually exclusive. For instance, one can have SOAPs such as (7 9 9), which involves one type 7 bit-pattern and two type 9 bit-patterns. We further restrict ourselves to SOAPs that occur frequently during the folding process (frequent SOAPs). However, we are not ruling out rarely-occurring SOAPS in our future studies. A SOAP is said to be frequent if it appears in no fewer than minSupp frames in a trajectory. In our studies, we set minSupp = 5 for BBA5 and 10 from GSGS. SOAP Episodes The next step is to capture the evolutionary nature of the folding process. We do this by identifying the evolutionary nature of SOAPs. As mentioned earlier, small proteins like BBA5 and GSGS often fold hierarchically and begin with local folded structures. As they fold, new SOAPs can be created and existing one can dissipate. To capture such evolutionary behavior, we proposed the concept of SOAP episodes, which provide an effective approach to model the evolution of interactions among spatial objects over time [5]. To reiterate, a SOAP episode E is defined as follows: E = (p, Fbeg, Fend), where p is a SOAP composed of one or more bit-patterns, p was created in frame Fbeg and persisted till frame Fend. Note that for a given p, it can be created more than once during protein folding, and thus can have more than one episode. To discover frequent (minLink = 1) SOAPs and their episodes in the trajectories of BBA5 and GSGS, we apply our SOAP mining algorithm as explained in our previous work [5]. In summary, this mining phase produces the following results: (i) A list of (minLink = 1) SOAPs of bit-patterns that appeared in at least 5 conformations in each folding trajectories for the protein BBA5 and 10 for GSGS; and (ii) A list of episodes, ordered by beginning frame Fbeg, associated with each of these SOAPs. 3.3 Folding Trajectory Analysis In this section, we describe our strategy on utilizing SOAPs to summarize a folding trajectory and address the two folding analysis issues described in Section 2.3. SOAP-based Trajectory Summarization The previous mining phase discovers a collection of frequent (minLink = 1) SOAPs and the associated episodes in each trajectory. Therefore, it identifies all the conformations in the trajectories that contain at least one frequent (minLink = 1) SOAPs. For instance, the last conformation in trajectory T23 (Figure 1(c)) has two SOAPs of size 2:(5 8) (i.e., association of a type 5 and a type 8 bit-pattern) and (7 8), and three SOAPs of size 1: (5), (7), and (8), while the last conformation in trajectory T24 has three SOAPs: (7 8), (7) and (8). This leads to our SOAP-based approach for folding trajectory summarization. To summarize a folding trajectory, we perform the following three steps. First, for each conformation, we identify all the frequent SOAPs that appear in it and use these SOAPs to represent this conformation. Note that not every conformation contains frequent SOAPs, especially when minSupp is set high. Second, for each SOAP-representable conformation, we carry out two tasks on its associated SOAPs. We next use the folding trajectories of BBA5 to explain how these two tasks are carried out. In the first task, for each SOAP, we mark the relative location of each involved bit-pattern in the primary sequence of BBA5. This is done by identifying the segment of BBA5 where the majority of a bit-pattern's α-carbons are located. The segment can be one of the following as described in Section 2.2: F1, residues 1 – 10; F2 , residues 11 – 23; F3, residues 6–17; and F4: residues 1–5 and 18–23. Let us again take the last conformation in T24 as an example. It can be summarized by three SOAPs: (7 8), (7) and (8). When we look at the list of α-carbons involved in these bit-patterns, we find out that 7 is mainly located in F2 and 8 in F1. Therefore, we mark the three SOAPs as follows: (8.1 7.2), (7.2) and (8.1). We re-arrange the bit-patterns in a SOAP by their relative locations in BBA5. This super-imposes BBA5-specific spatial information to a SOAP. In the second task, we prune away redundant SOAPs after marking each bit-pattern with its relative location in BBA5. A SOAP is redundant if it is embedded in another SOAP. For instance, in the previous example, we can prune away (8.1) and (7.2) as both are embedded in (7.2 8.1). After pruning, most conformations in such a small protein can often be represented by a single SOAP. We can even take this summarization a step further, where we replace a bit-pattern with its corresponding 3D motif, as illustrated in Figure 5. For instance, SOAP (7.2 8.1) will be transformed into (β.1 α.2). We refer to such SOAPs as generalized SOAPs, and the corresponding trajectory as a generalized trajectory. Note that in a generalized trajectory, multiple types of bit-patterns can be mapped into a single type of 3D motif. For instance, the α-motif corresponds to three types of bit-patterns 4, 7, and 9 (Figure 5). Figure 9 shows a segment in each summarized BBA5 folding trajectory before and after being generalized with 3D motifs. Figure 9 SOAP-based folding trajectory summarization. An sample segment in each of the two BBA5 folding trajectories is presented, (a) After superimposing the relative location of each bit-pattern and pruning away redundant SOAPs. (b) After further generalizing each bit-pattern by corresponding 3D motif. Detecting Folding Events and Recognizing Ordering Among Events Once each folding trajectory is summarized into generalized SOAPs, it is fairly straightforward to detect folding events such as the formation of α-helix or β-turn like local structures. This can be done by simply locating the frames that contain the local motif(s) of interest. We can also easily identify native-like conformations, by finding those that contain the generalized SOAP (β.1 α.2). Finally, based on the summarization, one can quickly identify the ordering of folding events in a trajectory. For instance, to check which secondary structure forms more rapidly, α-helix or β-hairpin, one can simply compare the first occurrence of these structures in the summarized trajectory (Figure 9(b)). Identifying the Consensus Partial Folding Pathway Across Trajectories To do this, we simply compute the longest common sub-sequence (LCS) [17] between two summarized trajectories. One can utilize the summarization either before the 3D motif generalization (Figure 9(a)) or after (Figure 9(b)). We use the latter in our analysis. Based on the LCS of generalized SOAPs, we construct the consensus folding pathway by identifying pairs of conformations, one from each trajectory, along the LCS of two summarized trajectories. In other words, the resulting consensus pathway consists of a sequence of conformation-pairs of similar 3D structures. Notice here that the comparison between 3D protein conformations (as described in Section 2.2) is done by using bit-patterns to model local structural motifs, and associations of bit-patterns (SOAPs) to characterize the global structure. This forms a hierarchical comparison and is in accordance with the hierarchical folding process of small proteins.