Background
Motif discovery is the problem of finding approximately repeated patterns in unaligned sequence data. It is important in uncovering transcriptional networks, as short common subsequences in genomic data may correspond to a regulatory protein's binding sites, and in protein function identification, where short blocks of conserved amino acids code for important structural or functional elements.
The biological problems addressed by motif finding are complex and varied, and no single currently existing method can solve them completely (e.g., see [1,2]). For DNA sequences, motif finding is often applied to sets of sequences from a single genome that have been identified as possessing a common motif, either through DNA microarray studies [3], ChIP-chip experiments [4] or protein binding microarrays [5]. An orthogonal approach, which attempts to identify regulatory sites among a set of orthologous genes across genomes of varying phylogenetic distance, is adopted by [6-10]. For protein sequences, and especially in the case of divergent sequence motifs, it is particularly useful to incorporate amino acid substitution matrices [11,12]. Often, motif finding methods are either tailor-made to a specific variant of the motif finding problem, or perform very differently when presented with a diverse set of instances.
Numerous approaches to motif finding have been suggested (e.g., [13-24], and those referenced in [1]). These methods differ mainly in the choice of the motif representation, the objective function used for assessing the quality of a motif, and the search procedure used for finding an optimal (or sub-optimal but reasonable) solution. Two broad categories of motif finding algorithms can be identified [25]: stochastic-search methods based on the position-specific scoring matrix (PSSM) representation and combinatorial approaches based on variants of the consensus sequence representation. Both categories come with their own sets of scoring functions (e.g., see [25,26]), and most variants of the motif finding problem are NP-hard, including those optimizing either the average information content score or the sum-of-pairs score [27]. The performance of these two broad groups of methods seem to be complementary in many cases, with a slight performance advantage demonstrated by representative methods of the combinatorial class (e.g., Weeder [24]), as reported in a recent comprehensive study [1]. However, many combinatorial methods enumerate every possible pattern, and are thus limited in the length of the motifs they can search for. While this may be less of an issue in eukaryotic genomes, where transcriptional regulation is mediated combinatorially with a large number of transcription factors with relatively short binding sites, substantially longer motifs are found when considering either DNA binding sites in prokaryotic genomes (e.g., for helix-turn-helix binding domains of transcription factors) or protein motifs [28,29].
Here, we introduce a combinatorial optimization framework for motif finding that is flexible enough to model several variants of the problem and is not limited by the motif length. Underlying our approach, we consider motif discovery as the problem of finding the best gapless local multiple sequence alignment using the sum-of-pairs (SP) scoring scheme. The SP-score is one of many reasonable schemes for assessing motif conservation [30,31]. In the case of motif search, where the goal is to use a set of known motif instances and uncover additional instances, the SP-score has been shown empirically to be comparable to PSSM-based methods [32]. Additionally, unlike the PSSM models, which typically assume independence of motif positions, the SP-score can address the problem of nucleotide or amino acid dependencies in a natural way. This is an important consideration; for example in the case of nucleotides, it has been shown that there are interdependent effects between bases [33,34]. Moreover, modeling these dependencies using the SP-score leads to improved performance in representing and searching for binding sites; a similar statistically significant improvement is not observed when extending PSSMs to incorporate pairwise dependencies [32]. The SP-score was most recently utilized in the context of motif finding in MaMF [13].
In this paper, we use the SP-score and recast the motif finding problem as that of finding a maximum (or minimum) weight clique in a multi-partite graph, and introduce a two-pronged approach, based on graph pruning and mathematical programming, to solve it. In particular, we first formulate the problem as an integer linear program (ILP) and then consider its linear programming relaxation. In practice, the linear programs (LPs) arising from motif finding applications can be prohibitively large, numbering in the millions of variables. Thus, to reduce the size of the LPs, we develop a number of new pruning techniques, building upon the ideas of [35,36]. These fall into the broad category of dead-end elimination (DEE) algorithms (e.g., [37]), where sequence positions that are incompatible with the optimal alignment are discarded. In practice, such methods are very effective in reducing problem size; to handle the rare cases where the DEE techniques do not sufficiently prune the problem instance, we also develop a heuristic iterative scheme to eliminate sequence positions. The reduced linear programs are then solved by the ILOG CPLEX LP solver, and in cases where fractional solutions are found, an ILP solver is applied.
Given a motif discovered by any method, it is important to be able to assess its statistical significance, as even optimal solutions for their respective objective functions may result in very poor motifs. We demonstrate how to test the statistical significance of the motifs discovered via the graph pruning/mathematical programming approach by using the background frequencies for each base or amino acid and computing the motif scores' probability distribution [38]. We then assess the number of motifs of the same or better quality that are expected to occur in the data at random. In the few cases where the heuristic DEE procedure is applied, we are able to give a lower bound on the significance value of the optimal solution; this allows us to evaluate how much better an alternate undiscovered motif might be.
We test our coupled mathematical programming and pruning approach, LP/DEE, in diverse settings. First, we consider the problem of finding shared motifs in protein sequences. Unlike commonly-used PSSM-based methods for motif finding (e.g., [15,18]), our combinatorial formulation naturally incorporates amino acid substitution matrices. For all tested datasets, we find the actual protein motifs exactly, and these motifs correspond to optimal solutions according to the SP scoring scheme. Second, we consider sets of genes known to be regulated by the same E. coli transcription factor, and apply our approach to find the corresponding binding sites in genomic sequence data. We compare our results with those of three popular methods [18,22,39], and show that our method is often able to better locate the actual binding sites. Using the same dataset, we also embed E. coli binding sites within sequences of increasingly biased composition, and show that our scoring scheme and motif finding procedure is effective in this scenario as well. Third, we consider the phylogenetic footprinting problem [9], and find shared motifs upstream of orthologous genes. The difficulty of this problem lies in that the sequences may not have had enough evolutionary time to diverge and may share sequence level similarity beyond the functionally important site; incorporation of additional information, in the form of weightings obtained from a phylogenetic tree relating the species, proves useful in this context. Finally, we demonstrate in the context of phylogenetic footprinting that our formulation can be used to find multiple solutions, corresponding to several distinct motifs. In all scenarios, we test the uncovered motifs for statistical significance. We show that our method works well in practice, typically recovering statistically significant motifs that correspond to either known motifs or other motifs of high conservation.
Interestingly, the vast majority of motif finding instances considered are not only effectively pruned by the optimality-preserving DEE methods, but also lead to linear programs whose optimal solutions are integral. These two conditions together guarantee optimality of the final solution for the original SP-based motif finding problem. This is interesting, since the motif finding formulation is known to be NP-hard [27], and nevertheless our approach runs in polynomial time for many practical instances of the problem. Overall, the ability of our LP/DEE method to find optimal solutions to large problems demonstrates the power of the computational search procedures, and its performance in uncovering known motifs illustrates its utility for novel sequence motif discovery.