Real data inputs
The collection of orthologous intergenic regions, the division of species into clades, the multiple alignments, the phylogenetic trees, and the motif models needed as input to PhyloScan (or other similar algorithms) can be difficult to construct, and are unique to an individual's research interests and applications. We discuss our approaches in the following. The flowchart in Figure 6 depicts a high-level view of the intergenic sequence database generation and the application of PhyloScan to these data.
Figure 6  Data Processing Flow Chart for PhyloScan. An overview of the steps taken to locate Crp and PurR transcription factor binding sites in E. coli intergenic regions. The species examined were Escherichia coli (EC), Salmonella enterica serovar Typhi (S. typhi) (ST), Yersinia pestis (YP), Haemophilus influenzae (HI), Vibrio cholerae (VC), Shewanella oneidensis (SO), and Pseudomonas aeruginosa (PA). It is our belief that PhyloScan (and, e.g., MONKEY) are fairly robust to typical levels of error in these inputs, though further exploration is required to substantiate this claim.

Locating orthologous sequences
Genome sequence data and annotations were downloaded from the NCBI RefSeq database [42]: Escherichia coli K12 (NC_000913.1), Salmonella enterica serovar Typhi (S. typhi)(NC_003198), Yersinia pestis CO92 (NC_003143), Haemophilus influenzae Rd (NC_000907), Vibrio cholerae El Tor (NC_002505 and NC_002506), Shewanella oneidensis MR-1 (NC_004347 and NC_004349), and Pseudomonas aeruginosa PA01 (NC_002516). Orthologs for each of the annotated E. coli genes were identified in each of the remaining six species, using INPARANOID v.1.35 [43]. This program uses BLAST [44] to compare the complete set of predicted protein sequences from one genome with that of another, and identifies the reciprocal best hits. We set the parameters to use the BLOSUM62 matrix and a minimum bit score of 30, and we required that the alignment cover at least 50% of both proteins.
In the examples presented in this study, E. coli was the primary species of interest; we therefore identified a set of E. coli promoter-containing sequences by identifying each E. coli protein-coding gene (excluding 111 genes encoded on transposons or prophage elements) that has at least 20 bp of upstream intergenic sequence. By these criteria, there are 2379 E. coli intergenic regions of interest. Orthologous upstream intergenic-sequence data files were then generated for this set of 2379 E. coli regions, using the results from INPARANOID to identify orthologs, and the seven genome annotations to define intergenic boundaries. In the Supplementary Materials are a table with these data [see Additional file 2] and a caption for the table [see Additional file 1].

Designating clades
Among the species included in this study, only E. coli and S. typhi exhibit extensive homology (70% identity on average) in the promoter regions [26]. The phylogenetic distance of two sequences that share this level of homology is 0.384, assuming the nucleotide substitution model of Jukes & Cantor [45] (and the value would be similar under a variety of more current models); thus, we assumed this phylogenetic distance between E. coli and S. typhi, and data from these two species are taken to form one clade for PhyloScan. Each of the remaining species formed a separate clade of unaligned sequence data, since these species do not exhibit sequence identity with E. coli or with each other [26].
Generally, we would combine sequences into a single clade if their pairwise phylogenetic distances were comparable to that between E. coli and S. typhi, or nearer.

Constructing multiple alignments
With only two closely related species in our set, we chose the Smith-Waterman [46] pairwise, gapped local alignment algorithm (implemented as BestFit in the Wisconsin Package Version 10.3, Accelrys Inc., San Diego, CA) to align their orthologous intergenic regions, using default parameters (match = 10.000; mismatch = -9.000; gap creation penalty = 50; gap extension penalty = 3). The alignment of E. coli and S. typhi orthologous upstream intergenic sequences resulted in 1662 unique aligned sequence pairs. The upstream intergenic sequences for an additional 836 E. coli genes that did not have orthologs in S. typhi remained. The combination of these two datasets (1662 + 836 = 2498) does not equal the above number of E. coli intergenic regions of interest (2379 sequences), due to the complication of divergently transcribed genes. Specifically, we observed that for some divergently transcribed genes in E. coli, the orthologous genes in S. typhi are not syntenic, thus S. typhi provided two separate intergenic regions for alignment to a single intergenic region of E. coli.
To perform the real-data tests, three databases representing the reference species clade were generated for scanning: (1) a database containing the 2379 E. coli intergenic regions of interest, (2) a database containing only E. coli data ("E. coli reduced"), where 1662 E. coli intergenic regions have been reduced in sequence space by alignment with S. typhi orthologous data plus an additional 836 E. coli sequences for which there was no orthologous S. typhi data, and (3) a database containing 1662 E. coli-S. typhi aligned orthologous intergenic regions plus an additional 836 E. coli sequences for which there was no orthologous S. typhi data.

Producing a phylogenetic tree
We constructed the phylogenetic tree for the more complicated, synthetic sequence data set using 16S rRNA gene data via MUSCLE [30] and PHYLIP [31], scaling tree branch lengths up by a factor of 13.5, as described above – see Synthetic Sequence Data in the Results section. A tree constructed in this manner is not definitive but should be sufficient for use with PhyloScan.

Obtaining binding site motif models
E. coli Crp and PurR binding sites that have been experimentally identified by DNase I footprinting were extracted from the literature and available databases, RegulonDB [47] and DPInteract [48]. The 87 Crp sites (from 65 E. coli intergenic regions) and 22 PurR sites (from 20 E. coli intergenic regions), were aligned using the Gibbs Recursive Sampler [49] specifying palindromic models (total width of 16–24 bp), to generate a PurR motif (Figure 7) and a Crp motif (Figure 1). These figures show both the nucleotide equilibrium and the information content for each position of the motif [9].
Figure 7  PurR Binding Site Motif. Shown is the PurR motif used to scan for PurR binding sites. The binding site equilibria were calculated from sequence data aligned by the Gibbs Recursive Sampler [49], and were plotted using publicly available software [27].