Clustering Bit-patterns into Approximately Equivalent Groups In this step, we partition the extracted bit-patterns into approximately equivalent groups, each of which consists of bit-patterns that exhibit similar geometric properties (e.g., shape and size). To construct such equivalent groups, we run the k-means based clustering algorithm [22] over the bit-patterns' corresponding feature vectors, where k is the number of clusters (or equivalent groups) that will be produced. To determine an optimal value of k, we take the following three steps. First, we run the clustering algorithm on different k values. This produces different clustering schemes for the same set of bit-patterns. Second, for each clustering scheme, we compute its entropy. Let c1, ..., cl be the l clusters after clustering the set of N bit-patterns. Furthermore, each cluster ci (1 ≤ i ≤ l) has an individual entropy Hi and contains Ni elements, then the total entropy of this clustering is given by the following formula: H=∑i=1kHi∗(Ni/N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGibascqGH9aqpdaaeWaqaaiabdIeainaaBaaaleaacqWGPbqAaeqaaOGaey4fIOIaeiikaGIaemOta40aaSbaaSqaaiabdMgaPbqabaGccqGGVaWlcqWGobGtcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdUgaRbqdcqGHris5aaaa@3F89@ The entropy of each individual cluster, i.e., Hi , is computed by summing up the entropy of each of the six bit-pattern attributes such as its height and width. For an attribute, we compute its entropy in a cluster according to the procedure explained by Shannon [23]. In the third and final step, we plot the entropy against the number of clusters, i.e., k, and choose a value k where the entropy plot begins to show a linear trend. For the BBA5 folding data, this clustering step groups the 352 bit-patterns into 10 clusters (or types). As for the GSGS data, 12 clusters are identified. Intuitively, the 3D motifs of the bit-patterns in a cluster will also have similar 3D geometric properties. This is verified based on our manual analysis on the BBA5 trajectories. Figure 5 illustrates the representative 3D motifs.corresponding to the 9 of 10 types of bit-patterns identified in BBA5 trajectories. We omit type 0, as bit-patterns of this type, unlike the others, correspond to a wide variety of 3D motifs. Figure 5 Mapping between 2D bit-patterns and 3D sub-structures. The figure visualizes the representative 3D sub-structures corresponding to the 10 classes of bit-patterns identified in the contact maps along BBA5's two folding trajectories. The bit-patterns shown here are randomly selected from their respective group for illustration purpose. We also observed a similar scenario for the 12 types of bit-patterns identified in the GSGS trajectories. For instance, the typical 3D motifs of type 0 bit-patterns resemble the native conformation of GSGS (See Figure 2(a)); whereas those of type 6 identify with α-helices. Upon a closer look at this 2D-3D mapping illustrated in Figure 5, one can observe the following interesting aspects. First, multiple types of bit-patterns can be associated with a single type of 3D motif. For instance, there are 3 types of bit-patterns are mapped to an α-helical motif. Second, contrary to a commonly accepted belief that β-turns or β-sheets cannot be captured by maximally connected bit-patterns as defined earlier, our analysis shows that this belief does not stand. To illustrate this point, we take two examples. The first example, illustrated in Figure 6, corresponds to the β-turn structure. As shown in Figure 6(b), the β-turn formed by the first 10 Cα atoms of BBA5 can be captured by the maximally connected bit-pattern shown in Figure 6(a). The second example, shown in Figure 7, illustrates that a two turn β-sheet (Figure 7(b)) can also be captured by a bit-pattern (Figure 7(a)). Finally, not every type of bit-patterns can be mapped to a typical 3D motif. This might be attributed to our entropy-based criteria for selecting an "optimal" value of the parameter k in the clustering task. Figure 6 β-turns vs. maximally connected bit-patterns: an example. (a) A type 8 bit-pattern is identified in the 166th frame of the BBA5 T23 trajectory. This bit-pattern corresponds to the the connected 1s in the table, where a '1' indicates two corresponding Cα atoms are in contact,'-' otherwise. This pattern consists of the first 10 Cα atoms. (b) The 3D conformation of this frame, where the first 10 Cα atoms resembles a β-turn. Figure 7 β-sheets vs. maximally connected bit-patterns: an example. (a) A type 0 bit-pattern is identified in the 24201th frame of the GSGS T1 trajectory. This bit-pattern corresponds to the the connected 'x'-es in the table, where an 'x' indicates two corresponding Cα atoms are in contact,'-' otherwise. It consists of Cα atoms from 5 through 20. (b) The 3D conformation of this frame, where the 5–20 Cα atoms resembles a β-sheet of two turns. This demonstrates, to a certain extent, the advantage of using 2D contact maps to analyze 3D protein conformations. Undoubtedly, using contact maps greatly reduces the computational complexity, though at the cost of loss in structural information. However, some of this information loss is re-compensated by mapping bit-patterns to structural motifs in 3D conformations. More importantly, by exploiting different features in contact maps (bit-patterns in this work), we are able to connect 2D features with features in 3D space. In the BBA5 case, by identifying 10 types of bit-patterns in contact maps, we indirectly recognize 10 different 3D structural motifs in the folding conformations.