Coding sequence (CDS) prediction and gene identification ORFs likely to encode proteins (CDSs) were identified by GLIMMER [32]. Identified CDSs were annotated by manual curation of the outputs of a variety of similarity searches. Searches of the predicted coding regions were performed with BLASTP, as previously described [33]. The protein-protein matches were aligned with blast_extend_repraze, a modified Smith-Waterman [34] algorithm that maximally extends regions of similarity across frameshifts. Gene identification is facilitated by searching against a database of nonredundant bacterial proteins (nraa) developed at TIGR and curated from the public archives GenBank, Genpept, PIR, and SwissProt. Searches matching entries in nraa have the corresponding role, gene common name, percent identity and similarity of match, pairwise sequence alignment, and taxonomy associated with the match assigned to the predicted coding region and stored in the database. CDSs were also analyzed with two sets of Hidden Markov Models (HMMs) constructed for a number of conserved protein families from PFAM [35] and TIGRFAM [36]. Regions of the genome without CDSs and CDSs without a database match were reevaluated by using BLASTX as the initial search, and CDSs were extrapolated from regions of alignment. Finally, each putatively identified gene was assigned to one of 113 role categories adapted from Riley [37].