PMC:5610392 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"28632401-16845028-58538551","span":{"begin":1299,"end":1303},"obj":"16845028"},{"id":"28632401-18437229-58538552","span":{"begin":1327,"end":1331},"obj":"18437229"},{"id":"28632401-19740934-58538553","span":{"begin":1360,"end":1364},"obj":"19740934"},{"id":"28632401-19797408-58538554","span":{"begin":1393,"end":1397},"obj":"19797408"},{"id":"28632401-21183585-58538555","span":{"begin":1438,"end":1442},"obj":"21183585"},{"id":"28632401-21543442-58538556","span":{"begin":1460,"end":1464},"obj":"21543442"},{"id":"28632401-21486936-58538557","span":{"begin":1500,"end":1504},"obj":"21486936"},{"id":"28632401-22156162-58538558","span":{"begin":1549,"end":1553},"obj":"22156162"},{"id":"28632401-23748563-58538559","span":{"begin":1588,"end":1592},"obj":"23748563"},{"id":"28632401-24555784-58538560","span":{"begin":1835,"end":1839},"obj":"24555784"},{"id":"28632401-24555784-58538561","span":{"begin":2021,"end":2025},"obj":"24555784"},{"id":"28632401-26156781-58538562","span":{"begin":2223,"end":2227},"obj":"26156781"},{"id":"28632401-26156781-58538563","span":{"begin":2471,"end":2475},"obj":"26156781"},{"id":"28632401-26156781-58538564","span":{"begin":3045,"end":3049},"obj":"26156781"},{"id":"28632401-15173120-58538565","span":{"begin":5071,"end":5075},"obj":"15173120"},{"id":"28632401-12073331-58538566","span":{"begin":5249,"end":5253},"obj":"12073331"},{"id":"28632401-26156781-58538567","span":{"begin":7603,"end":7607},"obj":"26156781"},{"id":"28632401-26531826-58538568","span":{"begin":7950,"end":7954},"obj":"26531826"},{"id":"28632401-12520026-58538569","span":{"begin":7994,"end":7998},"obj":"12520026"},{"id":"28632401-18842628-58538570","span":{"begin":8036,"end":8040},"obj":"18842628"},{"id":"28632401-26156781-58538571","span":{"begin":8145,"end":8149},"obj":"26156781"},{"id":"28632401-26156781-58538572","span":{"begin":8418,"end":8422},"obj":"26156781"},{"id":"28632401-26156781-58538573","span":{"begin":8867,"end":8871},"obj":"26156781"},{"id":"28632401-26156781-58538574","span":{"begin":8952,"end":8956},"obj":"26156781"},{"id":"28632401-26156781-58538575","span":{"begin":9806,"end":9810},"obj":"26156781"},{"id":"28632401-12073331-58538576","span":{"begin":9974,"end":9978},"obj":"12073331"},{"id":"28632401-26156781-58538577","span":{"begin":11556,"end":11560},"obj":"26156781"},{"id":"28632401-24555784-58538578","span":{"begin":12814,"end":12818},"obj":"24555784"},{"id":"28632401-24555784-58538579","span":{"begin":13709,"end":13713},"obj":"24555784"},{"id":"28632401-24555784-58538580","span":{"begin":13822,"end":13826},"obj":"24555784"},{"id":"28632401-26156781-58538581","span":{"begin":17101,"end":17105},"obj":"26156781"}],"text":"MOTIFSIM 2.1: An Enhanced Software Platform for Detecting Similarity in Multiple DNA Motif Data Sets \n\nAbstract\nAbstract\nFinding binding site motifs plays an important role in bioinformatics as it reveals the transcription factors that control the gene expression. The development for motif finders has flourished in the past years with many tools have been introduced to the research community. Although these tools possess exceptional features for detecting motifs, they report different results for an identical data set. Hence, using multiple tools is recommended because motifs reported by several tools are likely biologically significant. However, the results from multiple tools need to be compared for obtaining common significant motifs. MOTIFSIM web tool and command-line tool were developed for this purpose. In this work, we present several technical improvements as well as additional features to further support the motif analysis in our new release MOTIFSIM 2.1. \n \n\n1. Introduction\nMotifs are often short sequences of a similar pattern found in sequences of DNA or protein. Binding site motifs play an important role in revealing the transcription factor that controls the gene expression. Many motif finding tools have been developed in the past years such as MEME (Bailey et al., 2006), GLAM2 (Frith et al., 2008), CisFinder (Sharov and Ko, 2009), W-ChIPMotifs (Jin et al., 2009), CompleteMOTIFs (Kuttippurathu et al., 2011), DREME (Bailey, 2011), MEME-ChIP (Machanick and Bailey, 2011), RSAT peak-motifs (Thomas-Chollier et al., 2012), and PScanChIP (Zambelli et al., 2013) among many others. Each tool possesses its unique features for discovering motifs that are undetectable by others. Previous study showed that the results produced by different motif finders for the same data set are diverse (Tran and Huang, 2014). Therefore, using multiple tools for finding motifs is suggested because motifs reported by several different tools are more likely to be biologically significant (Tran and Huang, 2014). However, the results from multiple tools for the same data set require comparing against each other for finding common motifs and those generated by some tools but not by others (Tran and Huang, 2015).\nPrevious study showed the difficulty of comparing multiple motif data sets, and hence it motivated to develop MOTIFSIM (MOTIF SIMilarity Detection Tool) for automatically detecting similarity in multiple DNA motif data sets (Tran and Huang, 2015). The initial releases of MOTIFSIM provided researchers with a command-line tool for comparing motifs locally and a user-friendly web tool for comparing motifs on-line. The web tool provides convenience for users to save the data sets and experimental results on-line for retrieval. MOTIFSIM web tool and command-line tool accept various input formats and produce multiple results for further analysis. The results include the global significant motifs, the global and local significant motifs, as well as best matches for each motif in every data set (Tran and Huang, 2015). The new version MOTIFSIM 2.1 further supports users with several technical improvements as well as additional features.\n\n2. Technical Improvements\nWe present numerous technical improvements for the web tool and the command-line tool as follows.\n• Automatically recognize motif input formats. The new version can automatically detect motif's format. In addition, motifs in different formats can be mixed and matched in the same input file and the tools can automatically recognize their formats. • Insert motif input on the browser. In addition to upload and use existing files, the new version allows inserting as many as 20 motif files on the browser for running the web tool. • Increase number of motif data sets for comparison. The initial release of the web tool allowed comparing up to 10 motif data sets simultaneously. The new version allows comparing up to 20 motif data sets concurrently. • Option for number of top significant motifs, output file type, and output file format. MOTIFSIM 2.1 provides more flexibility for users to select the input and output parameters. We added an option for number of top significant motifs. This is a cutoff for the number of top significant motifs to be generated in the results for the global significant motifs as well as the global and local significant motifs. This option also allows users to select as many as 50 top significant motifs. In addition, users can select the output file type and output file format for the results. The output file type option allows selecting the global significant motifs or selects everything otherwise. The output file format option allows selecting a desire output format. • HTML and PDF formats with sequence logos. We added HTML and PDF format options for generating the results. The conversion of HTML to PDF is supported by Prince software package (Prince, 2002). We also added sequence logos for each motif and its reverse complement for these formats. The sequence logos are created by WebLogo software package (Crooks et al., 2004). • Combined motifs list. We added a combined motifs list containing motifs from all data sets in the results. The motifs are in position-specific probability matrices (Li, 2002) and they are in the order of the data sets entered by the user. • Additional global significant motifs list. We include an additional global significant motifs list in the results for further analysis. The list can be generated in HTML, PDF, Text, or in all three formats. • Consensus sequences and motif alignment in IUPAC format. We include the consensus sequences for each motif and its reverse complement in the results. The motif alignment in IUPAC format is also added in the results for better observation. • Job submission history. We added a job submission history to the web tool for users to view and access their submitted jobs. Private jobs can only be accessed by the job's owner. Public jobs are accessible to everyone. • Job search. Unregistered users can keep submitted jobs and the results private. The results can be retrieved through the Search Job ID page. • Email notification. Registered users of the web tool now receive an email notification when a submitted job is completed and available for download and viewing. • Other improvement. We added the Input and Results sections to the result page of the web tool. Users can view the combined motifs and the results when a job is completed without leaving the page. Supplementary Figures S1–S4 in the Supplementary Materials demonstrate the improvements already described.\n\n3. Additional Features\nIn addition to the technical improvements already described, MOTIFSIM 2.1 provides additional features for further analyzing similar motifs. The global significant motifs as well as every motif in the combined list can be compared with motifs in a database for obtaining similar motifs. In addition, it is often desired to combine similar motifs into new motifs to reduce the number of redundant motifs. MOTIFSIM 2.1 provides such option for combining similar motifs reported for the global significant motifs, the global and local significant motifs, as well as the best matches for each motif. Besides, users can observe the relationship between motifs through the phylogenetic trees. These features are described in the following section.\n\n3.1. Matching motifs with motif database\nTo match the global significant motifs as well as every motif in the combined list with motifs in a database, we implemented a slightly modified version of our novel algorithm (Tran and Huang, 2015). Instead of comparing motifs with each other in the combined list as in the original algorithm, we compare the global significant motifs and every motif in the combined list with each motif in a database using the same technique as described in the original algorithm. Currently, MOTIFSIM 2.1 supports Jaspar version 2016 (Mathelier et al., 2016), Transfac free version (Matys et al., 2003), and UniPROBE (Newburger and Bulyk, 2009) databases.\n\n3.2. Merge similar motifs\nTo merge similar motifs reported in the results (Tran and Huang, 2015), we merge the motif and its best matches iteratively into new motifs in a pair-wise manner. First, the motif and its most similar motif in the best matches list are merged into the new motif from their best alignment calculated by a similarity score (Tran and Huang, 2015). To merge two motifs from their best alignment, we take the average of the overlapping portion between them and carry over the hanging portions from the left, right, or both sides from the alignment into the new motif. Figure 1 illustrates this process. To ensure the new motif is still within the similarity threshold with its parents, we compare the new motif back with each of its parent by using the similarity percentage (Tran and Huang, 2015). If one of the similarity percentages is out of the threshold (Tran and Huang, 2015), the process stops. Otherwise, the new motif is then merged with the next similar motif in the best matches list. This process goes on until the list is exhausted or the similarity percentage falls outside the threshold. Figure 2 shows an example of merging motif GTCGCG and its five best matches from highest to lowest. The process starts by merging motif GTCGCG with its first best match from their best alignment. The merged motif GBCGCGCGGC is subsequently merged with the next best match in the list. The process goes on until the list is exhausted and it results in the final merged motif SSGCGCSGCGGCSS. All merged motifs fall within the similarity percentage with their parents.\nFIG. 1. Pair-wise merging of two similar motifs. (A) Alignment of two similar motifs CCGCCGCC and SSSCGSSGCSSS by using similarity percentage (Tran and Huang, 2015). The merged motif is SSCCGCSGCCSS. Motifs are in IUPAC format. (B) Details for merging two motifs in (A). Motifs are in position-specific probability matrix (Li, 2002). Motif CCGCCGCC (left) aligns with motif SSSCGSSGCSSS (middle). Merged motif SSCCGCSGCCSS is in the right. The rectangle box shows the overlapping portion between two motifs. The average of corresponding elements between two motifs in rectangle box is equivalent to bold element in the merge motif SSCCGCSGCCSS. The elements which are not in bold in the merged motifs are carried over from motif SSSCGSSGCSSS. They are in two rows on the top and in two rows at the bottom of the merged motif SSCCGCSGCCSS. (C) Motif logos for the alignment and merged motif in (A).\nFIG. 2. Merging of a motif and its best matches. Motifs are in IUPAC format. (A) Motif GTCGCG and its five best matches from highest to lowest. (B) Pair-wise merging of motif GTCGCG and its best matches. Merging starts with motif GTCGCG and its first best match CGGCYBCGCG. The merged motif GBCGCGCGGC is subsequently merged with the second best match in the list. The process goes on until the list is exhausted and it results in the final merged motif SSGCGCSGCGGCSS. All merged motifs lie within the similarity percentage with their parents. Pair-wise matching details are also included.\n\n3.3. Phylogenetic trees\nMOTIFSIM 2.1 provides an option for generating the phylogenetic tree for observing the relationship between motifs. The phylogenetic tree is built by using hclust function in R (R Core Team, 2016). This function implements the hierarchical clustering algorithm. The distance matrix, which is used to feed into hclust for building the tree, contains the best similarity scores (Tran and Huang, 2015) between motifs. To generate the phylogenetic tree for all motifs in the combined list, MOTIFSIM 2.1 builds the distance matrix containing the best similarity scores between motifs and then feeds it into hclust for generating the tree. The phylogenetic tree for the global significant motifs and their best matches is generated by using a subset of this distance matrix, which contains only the best similarity scores between the global significant motifs and their best matches.\n\n3.4. Using MOTISIM 2.1\nMOTIFSIM 2.1 web tool and command-line tool were designed for simple use. Detailed examples for running both tools can be found in the Supplementary Materials. Further instructions can be found in the user manual on the tool's website.\n\n3.5. Case studies\nWe present the application of MOTIFSIM 2.1 in two case studies in the following sections. The data sets used in the case studies were produced by several motif finders, including CisFinder, DREME, MEME-ChIP, PScanChIP, and RSAT peak motifs for the same peak data. The peak data sets were generated from ChIP-Seq data sets produced by ChIP-Seq experiments on mouse liver tissue for two marks, H3 lysine 27 acetylation (H3K27ac) and histone H3 lysine 4 monomethylation (H3K4me1) (Tran and Huang, 2014). The ChIP-Seq data sets are given in Table 1 and the motif data sets are in Table 2. Since different motif finders implement different algorithms and possess unique features for detecting motifs, the results reported by them vary for the same ChIP-Seq data set. In particular, four motif finders given in Table 2 report different number of motifs for the same ChIP-Seq data set DM01. These numbers differ significantly from one tool to another. Thus, it is useful to identify common motifs reported by these tools because these motifs are more significant. MOTIFSIM can identify such motifs and reported them as the global significant motifs. It can also identify the global and local significant motifs as well as best matches for every motif in each data set.\nTable 1. ChIP-Seq Data Sets The data sets were generated from ChIP-Seq experiments on mouse liver tissue (Tran and Huang, 2014).\nTable 2. Motif Data Sets Used in Case Studies The data sets came from experiments in Tran and Huang (2014). Case study 1 presents the use of MOTIFSIM 2.1 for identifying similar motifs in a single data set, whereas Case study 2 identifies similar motifs in multiple data sets. In both case studies, we used the same input parameters for comparing motifs. These parameters include top 10 significant motifs, 5 best matches for each motif, and 75% or greater for similarity cutoff. The option All was used for selecting both output file type and output file format. In addition, the results were further compared with motifs in the UniPROBE database for mouse. We also generated the phylogenetic trees to observe the relationship between motifs as well as combined similar motifs reported in the results.\n\n3.5.1. Case study 1: ChIP-Seq data set DM721 for H3K27ac (H3 lysine 27 acetylation)\nIn this case study, we identified similar motifs in a single data set produced by CisFinder given in Table 2. This tool reported 153 cluster motifs. We ran MOTIFSIM 2.1 on this data set using the input parameters already described. Table 3 shows the top 10 significant motifs reported by the tool. The five best matches for the first and fifth significant motifs are given in Table 4. These best matches are not only similar to their top significant motif but they are also similar to each other. In particular, motif C125 and motif C070 share the same motif Sox7_secondary in UniPROBE database for mouse as the first best match for motif C125 and the second best match for motif C017. Likewise, motif C021 and motif C070 share the same motif Sox12_secondary in UniPROBE database for mouse as the first best match for motif C012 and as the second best match for motif C070. In addition, motif C053 and motif C071 share an identical motif Gli1_v016060_primary in UniPROBE database for mouse as the first best match for motif C053 and the third best match for motif C071. Thus, by analyzing these similar motifs, it is useful for determining whether they are redundant motifs. MOTIFSIM 2.1 also provides the option for combining similar motifs. Tables 5 and 6 show the merging for motif C108 and its five best matches as well as for motif C023 and its best matches, respectively. The detailed merging results can be found in the user manual on the tool's website. We further matched each motif in the data set with motifs in UniPROBE database for mouse. Table 7 shows the first best match in the database for each top 10 significant motifs. The detailed matching results for each motif with the database can be found in the user manual on the tool's website. To observe the relationship between motifs, we generated a phylogenetic tree shown in Figure 3 for all motifs in the data set. In this figure, the most similar pair of motifs by similarity score is placed in one cluster. The cluster is joined with the next similar motif. Similar clusters are joined until they form a complete phylogenetic tree. The motif is labeled by concatenating its ID with its name for easy differentiation, as the same motif may appear multiple times in the combined list because it is reported by multiple motif finders.\nFIG. 3. A phylogenetic tree for all cluster motifs in the data set. The tree was created by using a distance matrix consisting of best similarity scores between motifs (Tran and Huang, 2015). Motif ID is concatenated with motif name at the label of the tree.\nTable 3. Top 10 Global and Local Significant Motifs in Case Study 1 Motifs are listed by ID, name, and logos.\nTable 4. Five Best Matches for the First and Fifth Significant Motifs in Table 3 The best matched motifs are listed in the order of similarity from highest to lowest.\nTable 5. Merging Motif C108 and Its Five Best Matches in Table 4 Merging begins with motif C108 and its first best match C125. The combined motif is subsequently merged with the second best match C021. The process stops with the final merged motif VDSTSTSTBTSTCBSTVTSW. All merged motifs fall within similarity threshold with their parents. The pair-wise alignment of motifs, matching format of each motif, matching direction, matching position, and the number of overlaps are included.\nTable 6. Merging Motif C023 and Its Five Best Matches in Table 4 Merging starts with motif C023 and ends with the merged motif GSCBCVSSCCSSCCCCCCCCSSCCCCSSSC. All merged motifs fall within similarity threshold with their parents. Pair-wise matching information is included.\nTable 7. Matching Top 10 Significant Motifs with Motifs in UniPROBE Database for Mouse The first best match in the database for each significant motif is included. Motif ID, motif name, motif logo, and motif format are included.\n\n3.5.2. Case study 2: ChIP-Seq data set DM01 for H3K4me1 (histone H3 lysine 4 monomethylation)\nThis case study demonstrates the use of MOTIFSIM 2.1 for identifying similar motifs in multiple data sets generated by four different tools including DREME, MEME-ChIP, PScanChIP, and RSAT peak-motifs for the same ChIP-Seq data set DM01 given in Table 2. These motif finders report different number of motifs. Thus, it is useful to identify common motifs reported by them. MOTIFSIM 2.1 identifies these common motifs as the global significant motifs. It also identifies the global and local significant motifs, as well as best matches for each motif in the combined motif list. The top 10 global significant motifs are given in the Supplementary Table S1. Each global significant motif and its best matches were reported by at least two motif finders. The top 10 global and local significant motifs are also given in the Supplementary Table S2. In this table, the ninth global and local significant motif, Motif 25, and its five best matches were reported by all four tools. However, the fifth global and local significant motif, ssCkGGYCCCsg, and its five best matches were reported by only one tool, which is RSAT peak-motifs. The motif ssCkGGYCCCsg and its best matches are given in the Supplementary Table S3. This observation allows users to determine whether these similar motifs are redundant motifs. The analysis can be carried out further for any motif and its best matches.\nSimilar motifs reported in the results for the global significant motif, the global and local significant motif, as well as for each motif in this case study were combined into new motifs. The detailed merging results can be found in the user manual on the tool's website. In addition, we further compared the global significant motifs, the global and local significant motifs, and each motif in the combined motif list with motifs in the UniPROBE database for mouse to obtain similar motifs. The Supplementary Tables S4 and S5 show the first best match in the database for each global significant motif as well as for each global and local significant motif, respectively. The detailed matching results with the database can be observed in the user manual on the tool's website. In addition, the relationship between motifs for the global significant motifs and their best matches, as well as for all motifs in the combined list can be further observed through the phylogenetic trees in the Supplementary Figures S11 and S12.\n\n4. Conclusion\nMOTIFSIM 2.1 web tool and command-line tool contain several technical improvements as well as additional features to further support the motif analysis. The new version allows combining similar motifs. It also supports the comparisons for the global significant motifs as well as every motif with motifs in a database. In addition, the relationship between motifs can be observed through the phylogenetic trees. MOTIFSIM 2.1 web tool and command-line tool including user manuals, test data sets, and test results are freely available at http://motifsim.org\n\nSupplementary Material\nSupplemental data "}