Methods

Genome sequencing and data collation
We determined the genome sequence of Staphylococcus simiae type strain CCM 7213T (= LMG 22723T), isolated from the faeces of a South American squirrel monkey [11]. Roche/454 pyrosequencing, involving a single full run of the GS-20 sequencer, was used to determine the sequence of the Staphylococcus simiae genome. The sequences were assembled (De novo assembly with Newbler Software) into 565 contigs. Genome annotation for the strain was done by the NCBI Prokaryotic Genomes Automatic Annotation Pipeline. The S. simiae whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AEUN00000000. The version described in this paper is the first version, AEUN01000000. For comparative analysis genome sequences of bacteria in GenBank format [12] were retrieved from the National Center for Biotechnology Information (NCBI) site ftp://ftp.ncbi.nlm.nih.gov/. We analyzed sequences of 28 Staphylococcus strains belonging to 12 different species, and an outgroup Macrococcus caseolyticus JCSCS5402 [13] (Table 1 and Additional file 1, Table S1). The 16 Staphylococcus aureus strains included COL [14], ED133 [15], ED98 [16], JH1, JH9, MRSA252 [17], MSSA476 [17], Mu3 [18], Mu50 [19], MW2 [20], N315 [19], NCTC_8325, Newman [21], RF122/ET3-1 [22], USA300_FPR3757 [23], and USA300_TCH1516 [24]. The remaining 12 Staphylococcus strains included Staphylococcus capitis SK14 [25], Staphylococcus caprae C87, Staphylococcus carnosus TM300 [26], Staphylococcus epidermidis ATCC 12228 [27], Staphylococcus epidermidis RP62a [14], Staphylococcus haemolyticus JCSC1435 [28], Staphylococcus hominis SK119, Staphylococcus lugdunensis HKU09-01 [29], Staphylococcus pseudintermedius HKU10-03 [30], Staphylococcus saprophyticus ATCC_15305 [31], Staphylococcus simiae [11], and Staphylococcus warneri L37603. Genome sequence analyses were implemented using Bioperl version 1.6.1 [32] and G-language Genome Analysis Environment version 1.8.12 [33-35]. Statistical tests and graphics were implemented using R, version 2.11.1 [36].
Table 1  Genomic features of Macrococcus caseolyticus and 28 Staphylococcus strains.  %G + C = 100 × (G + C)/(A + T + G + C).
S = Selected codon usage bias.
No.CDS = Number of protein-coding sequences.
No.MCL = Number of protein families built by BLAST and Markov clustering.

Gene content analysis
Protein-coding sequences were retrieved from chromosomes and plasmids of the 29 strains of bacteria (Table 1 and Additional file 1 Table S1). A group of homologous proteins (protein family) was built by all-against-all protein sequence comparison of the 29 strains' proteomes using BLASTP [37], followed by Markov clustering (MCL) with an inflation factor of 1.2 [38]. Homologous proteins were identified by BLASTP on the criteria of an E-value cutoff of 1e-5, and minimum aligned sequence length coverage of 50% of a query sequence. This approach yielded 5014 protein families containing 74122 individual proteins from the 29 strains (see Additional file 1, Table S2). We assigned functions to each protein family by using multiple databases: the Clusters of Orthologous Groups (COG) [39,40], JCVI [41], KEGG [42], SEED [43], Virulence Factors Database (VFDB) [44], MvirDB [45], Pfam [46], and Gene Ontology (GO) [47] database. We searched protein sequences against the Pfam library of hidden Markov models (HMMs) using HMMER http://hmmer.janelia.org/, and converted Pfam accession numbers to GO terms using the 'pfam2go' mapping http://www.geneontology.org/external2go/pfam2go. We performed TBLASTN searches (on the criteria of an E-value cutoff of 1e-5, and minimum aligned sequence length coverage of 50% of a query sequence) of each of the 29 strains' proteomes against whole nucleotide sequences of all the other strains to avoid artefacts caused by differences in protein-coding sequence prediction [8,48]. The resulting gene content (binary data for presence or absence of each protein family) is shown in Additional file 1, Table S2.
Hierarchical clustering (UPGMA) of the 29 strains was performed using a distance between two genomes based on gene content (binary data for presence or absence of each protein family) measured by one minus the Jaccard coefficient (Jaccard distance). To identify taxon-specific genes, we calculated Cramer's V to screen protein families showing biased distributions between comparative groups. Cramer's V is a measure of the degree of correlation in contingency tables. Cramer's V values close to 0 indicate weak associations between variables, while those close to 1 indicate strong associations. We used the most stringent threshold (i.e. Cramer's V of 1) to identify S. aureus and S. simiae unique proteins or protein families. To examine over- or underrepresented functional categories in the 16 S. aureus strains relative to the single S. simiae strain, a 2 × 2 contingency table was constructed for each functional category from the COG, JCVI, KEGG, SEED, VFDB, and GO databases: (a) the number of S. aureus protein families in this category; (b) the number of S. aureus protein families not in this category; (c) the number of S. simiae protein families in this category; and (d) the number of S. simiae protein families not in this category. The odds ratio (= ad/bc) was used to rank the relative over-representation (> 1) or under-representation (< 1) of each of the functional categories.

Phylogenetic analysis
Of the 5014 protein families, 497 were shared by all the 29 strains and contained only a single copy from each strain (did not contain paralogs). This set of 497 single-copy core genes were identified as putative orthologous genes. The sequences were first aligned at the amino acid level using Probalign [49], then backtranslated to DNA. Alignment columns with a posterior probability < 0.6 were removed, and alignments with > 50% of the sites removed were discarded from the analysis. Multiple alignments with Probalign retained 491 reliably aligned genes from a set of the 497 orthologous genes. Gene trees were reconstructed using PhyML (Phylogenetic estimation using Maximum Likelihood) [50,51] with the General Time Reversible plus Gamma (GTR + G) substitution model of DNA evolution, and the Subtree Pruning-Regrafting (SPR) branch-swapping method. Each gene tree search was bootstrapped (500 pseudoreplicates) using PhyML with the Nearest-Neighbor Interchange (NNI) branch-swapping method to detect genes that support or conflict with various bipartitions. A majority rule consensus of the gene trees was constructed using the consense program of PHYLIP 3.69 [52]. All the alignments were also concatenated, and a tree search was performed using PhyML with the same settings as for the gene trees. Macrococcus caseolyticus JCSCS5402 was used as an outgroup to root the trees. We used DendroPy [53] to annotate the nodes of the estimated consensus and concatenated gene trees with the percentage of gene trees in which the node was found. Resulting phylogenetic trees were drawn using the R package APE (Analysis of Phylogenetics and Evolution) [54].