Nucleic_Acids

PMC:3505973 JSON TXT 3 Projects

Massive gene acquisitions in Mycobacterium indicus pranii provide a perspective on mycobacterial evolution Abstract Understanding the evolutionary and genomic mechanisms responsible for turning the soil-derived saprophytic mycobacteria into lethal intracellular pathogens is a critical step towards the development of strategies for the control of mycobacterial diseases. In this context, Mycobacterium indicus pranii (MIP) is of specific interest because of its unique immunological and evolutionary significance. Evolutionarily, it is the progenitor of opportunistic pathogens belonging to M. avium complex and is endowed with features that place it between saprophytic and pathogenic species. Herein, we have sequenced the complete MIP genome to understand its unique life style, basis of immunomodulation and habitat diversification in mycobacteria. As a case of massive gene acquisitions, 50.5% of MIP open reading frames (ORFs) are laterally acquired. We show, for the first time for Mycobacterium, that MIP genome has mosaic architecture. These gene acquisitions have led to the enrichment of selected gene families critical to MIP physiology. Comparative genomic analysis indicates a higher antigenic potential of MIP imparting it a unique ability for immunomodulation. Besides, it also suggests an important role of genomic fluidity in habitat diversification within mycobacteria and provides a unique view of evolutionary divergence and putative bottlenecks that might have eventually led to intracellular survival and pathogenic attributes in mycobacteria. INTRODUCTION Mycobacterium indicus pranii (MIP) is a saprophytic mycobacterial species that is known for its immunomodulatory properties (1–11). In late 70s, this bacterium, initially coded as Mycobacterium ‘w’, was selected from a panel of atypical mycobacteria for its ability to evoke cell mediated immune responses against M. leprae in leprosy patients (2,9). MIP, which shares antigens with both M. leprae and M. tuberculosis, provides protection against M. tuberculosis infection in mice (3,10,12,13) and accelerates sputum conversion in both type I and type II category of tuberculosis (TB) patients when used as an adjunct to chemotherapy (14,15). In HIV/TB co-infections, a single dose of MIP converted tuberculin −ve patients into tuberculin +ve in >95% of the cases (16). This attribute is unique to MIP because similar application of other saprophytic mycobacteria such as M. vaccae does not provide commensurate protection (17). Based on its demonstrated immunomodulatory action in various human diseases, MIP is the focus of several clinical trials (Table 1) and successful completion of one such trial has led to its use as an immunotherapeutic vaccine ‘Immuvac’ against leprosy (18). However, very little information is available about MIP’s molecular, biochemical, genetic and phylogenomic features. Table 1. Ongoing clinical trials of MIP in a diverse set of diseases Recently, in a molecular phylogenetic study by using candidate marker genes and FAFLP (fluorescent-amplified fragment length polymorphism techniques) fingerprinting assay, we showed that MIP belongs to a group of opportunistic mycobacteria and is a predecessor of M. avium complex (MAC) (19). A comprehensive analysis of cellular and biochemical features of MIP along with chemotaxonomic markers such as FAME (fatty acid methyl ester) analysis and comparison with other mycobacterial species established that MIP is endowed with specific attributes (4). It has a growth rate (time of colony appearance ∼6–8 days) that is faster than the typical slow growers such as M. tuberculosis (∼3 weeks) and slower in comparison with typical fast growers, such as M. smegmatis (∼3 days), and thus placing MIP somewhere in-between the slow and fast grower mycobacterial species (4). In Mycobacterium, fast growers usually represent non-pathogenic organisms whereas slow growers are usually specialized pathogens. MIP does not cause any infection in mice, guinea pigs and monkeys, the animal models in which it has been tested (6). Biochemical analysis also showed that MIP shares several features that are exclusive to either slow growers or fast growers (4). Even the FAME profiling of MIP, a key test for appropriate taxonomic placement of microbes, and its comparison with the fatty acid complement from other mycobacterial species corroborated the placement of this saprophyte in between fast and slow growers (4). Thus, MIP represents an organism placed at an evolutionarily transitory position with respect to a fast grower and a slow grower or a saprophyte and a seasoned pathogen. It is known that mycobacterial species represent one of the most dramatic examples of host tropism and habitat diversification. Mycobacterium has more than 125 notified species including saprophytes such as M. smegmatis, immunomodulators such as M. habana, M. vaccae and MIP, opportunist M. avium and strict intracellular pathogens like M. tuberculosis and M. leprae. This unmatchable competence of mycobacterial organisms and their diverse physiological characteristics can be attributed to the genome dynamics including genome organization, gene content, coordinated gene expression and ability to interact with the host machinery. An important unanswered question in this context remains as to how the soil-living saprophytic mycobacterial species turned into one of the most notorious intracellular pathogens. Thus, understanding of the genomic basis of habitat diversification could be crucial in evolving effective control measures against mycobacterial infections. Unfortunately, despite the publication of several mycobacterial genomes (20–23), the understanding and details of advent of parasitism within mycobacterial lineages remain obscure (especially with in MAC) although the evolution of niche adapted parasitic forms by genomic downsizing is an accepted norm in M. tuberculosis complex (21,23). In fact, formal genetic studies on species differences and divergences in mycobacteria have been severely limited by the unavailability of a related organism that represents the border of optimization between saprophytic and pathogenic mycobacterial species. In prokaryotic evolution, a few species such as Shigella flexneri and Yersinia pestis have been identified, which represent an early stage of host restricted adaptation by means of genome shedding (24). MIP because of its unique phylogenetic placement and associated biochemical features seems to be the first case of a mycobacterium species caught in transition just before it resorted to the pathogenic adaptations. Thus, it provides a unique opportunity to understand evolutionary divergence and putative bottlenecks responsible for the advent of intracellular mode of survival and pathogenic attributes in mycobacteria. We have sequenced complete MIP genome to gain an insight into its unique life style and molecular basis of immunomodulation. In addition, we have employed comparative genomics to understand the habitat diversification and bases and means of functional genetic correlates responsible for evolution of pathogenicity in ancestral mycobacterial lineages. MATERIALS AND METHODS Sequencing of MIP genome The genome sequence of MIP was determined by employing Sanger sequencing by using a hybrid strategy of sequencing shot gun libraries (2 and 5 kb) and partial sequencing of some clones of large insert sized (>125 kb) BAC (bacterial artificial chromosome) library. Briefly, genomic DNA was isolated and whole genome shotgun libraries with average insert size of 2–3 kb and 4–5 kb were prepared by hydroshearing. Fragments of required size were gel-eluted, blunt-ended and cloned in plasmid vector pUC19. Clones were randomly picked from libraries having more than 90% insert and sequenced by Sanger’s di-deoxy terminator chemistry on ABI 3700 machines. A high quality BAC library was also prepared (www.mwg-biotech.com (15 August 2012, date last accessed)) and end sequenced by employing Sanger’s method to create a physical map of MIP genome that assisted in gap filling and resolving the ambiguities in genome assembly. Gap closing and the re-sequencing of low-quality regions were performed by sequencing the PCR products and the appropriate plasmid clones. These data were assembled by using the PHRED-PHRAP-CONSED package of software on four processor SunFire V400 series of server. Identification of open reading frames (ORFs) was carried out with the help of GLIMMER gene prediction software (25). Protein localization analysis was carried out with the help of PSORTB (26). Comparative proteome analysis of MIP with other species Functional annotation was carried out on the basis of sequence alignment with the known mycobacterial proteins as well as the COG (clusters of orthologous groups of proteins) (27) database with the help of BLAST (28) package. Several perl scripts were developed in-house for data analysis. To understand the effect of gene variations on the habitat diversification in mycobacterial species with respect to MIP, we performed BLAST analysis of MIP proteome against the proteomes of 18 other mycobacterial species used in this study. They were assigned to specific lineages of pathogenic and saprophytic mycobacteria based on their characteristic features, habitat and available literature. The members of M. tuberculosis complex including M. marinum and M. ulcerans and those belonging to M. avium complex were categorized as pathogenic group whereas rest of them were grouped as environmental mycobacteria. The positive hits against MIP proteins were filtered out and remaining genes (unique with respect to species under investigation) were analysed for their function based on COG classification and were quantified. This dataset was obtained for each species of both groups and was viewed as variation in unique gene content in each function category with respect to MIP. Analysis of rate of natural selection (Ka/Ks analysis) To understand the role of selection on speciation in MAC, the orthologous group of genes between MIP and M. avium subsp. hominissuis (MAH- human strain) and MIP and M. avium paratuberculosis (MAP-animal strain), were identified by using InParanoid program (29). This method bypasses multiple alignments and phylogenetic tree-based conventional approaches to detect orthology and thus minimizes any bias arising due to alignment or phylogeny method in the identification of orthologs. First, all possible pair wise similarity scores that scored higher than a cutoff value (bit score ≥ 50, overlap ≥70%, e ≤10−10) were detected from all-against-all BLAST comparisons and then the reciprocal genome-specific best hits were marked as orthologs. The orthologs were subsequently classified based on functional categories as per the similarity searches against COG database. The orthologs were aligned by using ClustalW (30) and each alignment was manually inspected for its correctness. Pairwise estimates of the non-synonymous (Ka) and synonymous (Ks) substitution rates were obtained by KaKs_Calculator program by using a maximum likelihood method based on the HKY85 model (31). Analysis of lateral gene acquisitions in MIP A combination of parametric methods, comparative genomics and phylogenetic approaches was employed to predict laterally acquired genes in MIP. First of all, we employed the three most popular parametric approaches namely Alien Hunter (32), genomic signature analysis (33) and by analysing atypical GC content of each ORF. Alien Hunter implements an interpolated variable order motifs theory to predict compositionally deviating regions with the highest recall value. The genome sequence of MIP was scanned and the fine tuning of the co-ordinates of alien regions was carried out by using advance optimization algorithm available in Alien Hunter. Besides, each MIP ORF was analysed for its length and nucleotide composition with respect to total and positional G + C contents (G + C [T], G + C [1], G + C [2] and G + C [3]). The genes were considered as extraneous on the basis of G + C content, if their total G + C (T) content deviated by >1.5 ς from the mean value of their genome or if deviations of G + C [1] and G + C [3] were of the same sign and at least one of them was >1.5 ς (34). The genes shorter than 300 bp and the genes coding for ribosomal genes were excluded from this analysis to avoid any extraneous results. We further augmented our analysis of MIP genome by using genomic signature based method previously used for mycobacteria (33,35). The genes that were scored by more than one method in these analyses were considered as laterally acquired. The genes confirmed by both genomic signature and GC content based methods were referred as recently acquired and these signatures were used to ascertain their likely source of acquisition. Further, we used the power of comparative genomics by analysing MIP genes for their presence/absence across available mycobacterial species ≤e−10). MIP regions having a non-uniform gene distribution across various mycobacteria, which are not scored by Alien Hunter, were annotated as RRD (regions of restricted distribution of genes). RRD has been defined as the region in MIP genome, which harbors the genes that are absent in a minimum of 33% of species investigated in this study and is at least represented by three contiguous genes or a region of >3 kb. The genes, which are absent in more than 50% of the species investigated were then referred as laterally acquired in RRDs and elsewhere in MIP genome. All the genes identified as possible lateral acquisitions in MIP were probed against COG database to analyse the functional role of gene acquisitions. Besides, laterally acquired genes were analysed by BLASTP algorithm against ACLAME (36), a database dedicated for the classification of mobile genetic elements (MGEs). Other in silico analysis and stress experiments CRISPR analysis was performed by using CRISPRFinder (37). Annotation of transporter genes was carried out by TransAAP (38). Pathogenic islands were inferred from PAIDB (39), the pathogenic island database. MIP was analysed by using Virulence factor database (VFDB) to ascertain the status of genes associated with virulence (40). In silico prediction of antigenicity was carried out with VAXIJEN (41). PFAM (http://pfam.sanger.ac.uk (15 August 2012, date last accessed).) was used to analyse and draw protein domains in a scaled manner. Motif scan tool at MyHits web server (http://myhits.isb-sib.ch (15 August 2012, date last accessed)) was used for further analysis of proteins and motifs (42). Phylogenetic analysis was performed by using maximum likelihood method available in Phylogeny Fr. Server (43). Influence of nutritional stress on MIP was evaluated on the basis of viable cell count at different time points (44). Statistical analyses Variations in gene distribution across different lineages were analysed by two-way ANOVA followed by Bonferroni posttests. P < 0.05 was considered as statistically significant. For studying natural selection, Fisher’s exact test (built in KaKs_Calculator program) for the small sample was applied to justify the validity of Ka and Ks calculated in this study. Only the ortholog pairs with P < 0.05 were considered for further analysis to infer the rate of natural selection. Paired t test was performed to ascertain the significance in the rate of selection between different organisms (P < 0.05). Total number of mycobacterial species analysed in this study ( = 18) The genome sequence along with annotation for the following organisms were downloaded from NCBI genome databanks and used in this study: M. marinum, M. ulcerans, M. tuberculosis H37Rv, M. tuberculosis H37Ra, M. tuberculosis CDC1551, M. tuberculosis F11, M. bovis, M. leprae, M. bovis BCG, M. avium supsp. paratuberculosis, M. avium 104, M. smegmatis, M. gilvum, M. abscessus, M. vanbaalenii, M. sps. JLS, M. sps. KMS and M. sps. MCS. RESULTS AND DISCUSSION Genome sequencing and general features of MIP genome Sequencing of MIP (DSM 45 239T) genome was carried out by whole genome shotgun (WGS) approach. A total of 109 792 paired end reads, comprising of more than 10× coverage of MIP genome, were generated from randomly picked shotgun clones from both ∼2 and ∼5 kb shotgun libraries followed by gap filling and sequence improvement. Sequence assembly with PHRAP resulted in the assembly of 93 592 shotgun sequences leading to a single circular MIP chromosome of 5 589 007 bp (Figure 1). This was subsequently validated by a BAC end sequence based physical map of MIP genome. Mycobacterial genomes range from 3.5 to 7 Mb and MIP with a size of ∼5.6 Mb represents a moderate genome size, which is larger than all known organisms of MAC. The genome contains 5270 predicted ORFs (at a density of ∼1 gene/kb), a single rRNA operon and 45 tRNA genes; these ORFs account for ∼91% of the genome (Table 2). The mean G + C content of MIP genome is 68%. However, the cumulative nucleotide skew analysis revealed several regions with a G + C content clearly divergent from this mean value, which cover considerable area in MIP genome and constitute potential sites to investigate for laterally acquired genes (Figure 1). The putative ‘ori’ in MIP genome was identified by a relatively AT rich region with characteristic DnaA boxes and a typical gene order of ‘rpnP-dnaA-dnaN’. The ‘ATG’ was found to be the most frequent start codon (56.5%) followed by ‘GTG’ (37.5%) and ‘TTG’ (5.9%). Like M. tuberculosis, MIP has an even distribution of ORFs on both strands with respect to the direction of replication (2656 on the leading strand and 2614 ORFs on lagging strand) (4). PSORTB analysis indicated that 55.5% of MIP proteins are cytoplasmic in nature, 13.5% are localized in the cytoplasmic membrane and only 3.5% are extra-cellular in nature (26). However, the precise localization of 27.5% of the proteins could not be ascertained. Figure 1. Circular representation of MIP genome. Whole genome sequencing of MIP revealed that it harbors a single circular chromosome of 5 589 007 bp. The accuracy of genome data assembly is ensured by a BAC end sequence based physical map of MIP genome. The size of MIP genome is much larger than the genome of any member of M. avium complex and thus is in agreement with the progenitor status of MIP (19). The red and blue tracks represent ORFs predicted in the sense and anti-sense orientation in relation to the ori (origin of replication). The inner most track represents the GC skew wherein sharp peaks of violet and yellow represent regions of AT and GC richness, respectively, and constitute potential targets for lateral gene analysis . Table 2. General genomic features of MIP BLAST-based comparative analysis of MIP ORFs (at a cut off value of ≥70% amino acid identity) revealed their maximum similarity with MAC organisms, which are evolutionarily close to MIP (Supplementary Figure S1). This is followed by M. marinum with which MIP shares over 51% of its coding sequences (CDS) (Supplementary Table S1). This observation is consistent with the status of MIP as the progenitor of MAC and supports the idea of a shared aquatic past between saprophytic and pathogenic mycobacteria (19,45). With M. tuberculosis, MIP shares only ∼40% of its proteins. However, the number of MIP ORFs (∼68%) shared by closely related MAC species strikingly differs in comparison with other related mycobacteria, which usually share over 90% of coding sequences even at identity >95% (22). This divergence could be a critical component for the elicitation of a robust yet unique immune response upon vaccination with MIP. Functional classification of MIP proteins To facilitate functional studies, MIP proteins were subjected to BLAST analysis against the COG database, which serve as a platform for functional annotation of newly sequenced genomes and for studies on genome evolution (27). On the basis of similarity with COG proteins, it was possible to assign functions to ∼80% of MIP proteins but ∼20% of the proteins still remain un-annotated. More significantly, ∼7.5% of proteins are unique to MIP and show no significant homology with other proteins present in mycobacterial proteomes. Several of these candidate orthologs are present in gene clusters, which are absent from most of the other mycobacteria, and thus indicating the modular nature of gene acquisitions or deletions in mycobacteria. Our analysis shows that 41.5% of MIP proteins belong to ‘Metabolism’ category, 11.5% to ‘ISP’ (information storage and processing), and 9.5% to ‘CPS’ (cellular processes and signaling) whereas 16.7% are ‘poorly’ categorized proteins (Figure 2). Within ‘Metabolism’ category, the genes pertaining to lipid transport and metabolism (I) were over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins were related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%) and translational, ribosomal structure and biogenesis (J) (24.5%). In case of ‘CPS’, major representation comes from cell wall/membrane/envelope biogenesis (M) (27.6%) followed by posttranslational modifications (O) and signal transduction mechanisms (T) at 23 and 21.4%, respectively (Figure 2). Figure 2. Functional classification of MIP proteins. (A) Representation of MIP proteome based on the similarity of its proteins with COG database (27). (B) represents distribution in cell processing and signaling category (CPS), (C) denotes distribution of poorly characterized proteins in MIP while (D) and (E) stands for information storage and processing (ISP) and ‘metabolism’ related genes, respectively. It is evident that ∼42% of total MIP genes are involved in basic metabolic functions and ∼21% do not have any homology in COG database. Within ‘metabolism’ category, the genes involved in lipid transport and metabolism (I) are over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins are related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%). Comparative proteome analysis of MIP with other species reveals the role of genomic fluidity in habitat diversification in Mycobacterium COG-based comparative analysis of gene distribution across mycobacterial proteomes highlights the presence of distinct genome fluidity. ‘ISP’ and ‘Metabolism’ proteins vary considerably with the maximum flexibility being observed in replication, recombination and repair (L), lipid transport and metabolism (I) and secondary metabolites biosynthesis and transport (Q), respectively (Figure 3). The minimum variations are observed in ‘CPS’ with nearly all sub-categories exhibiting a consistent representation. In ‘ISP’, the distribution of genes across all mycobacterial proteomes is almost consistent for translation, ribosomal structure and biogenesis (J) and chromatin structure and RNA processing (B), while a clear genomic fluidity is exhibited by the genes belonging to replication, recombination and repair (L). This category is least represented in MIP (3%) and maximally in M. ulcerans (10%). Similarly, the genes belonging to category K (transcription) are least represented in CDC1551 (5%) and maximally in M. smegmatis (9.4%), which is consistent with its saprophytic habitat. Figure 3. Comparative analysis of distribution of different mycobacterial proteomes under various COG functional categories. Different mycobacterial proteomes were downloaded from NCBI and subjected to COG-based BLAST analysis. The contribution of each functional category was calculated to observe the pattern of relative gene distribution across different mycobacterial species and plotted on this graph. (A) Distribution across ‘Metabolism’ category and various sub categories, (B) cell processing and signaling (CPS) and (C) information storage and processing (ISP). ‘X’ and ‘Y’ axis represent mycobacterial species and the number of mycobacterial proteins (in percentage), respectively. Our comparative analysis clearly highlights the presence of distinct genome fluidity in mycobacterial species across different functional groupings of genes. This genomic fluidity within different functional groups of proteins may contribute to the habitat diversification observed in mycobacterial species. In ‘Metabolism’, while the genes related to nucleotide transport and co-enzyme transport show a consistent distribution, the genes belonging to secondary metabolite biosynthesis and transport (Q), amino acid transport (E) and lipid transport and metabolism (I) show major quantitative variations. ‘I’ has the maximum representation in MAC like MAH (∼11%), MAP (10%) followed by MIP (9.5%) while ‘E’ and ‘Q’ are best represented in M. smegmatis (9.5%) and MAC organisms (9–10%), respectively (Figure 3). In case of carbohydrate transport and metabolism (G), all mycobacterial species have almost an equal representation except M. smgematis, which harbors almost twice (7%) the percentage of genes dedicated for this function in other mycobacterial species. In most COG categories, M. leprae seems to have a distinctly biased distribution of proteins probably indicative of the extensive gene-loss that the organism has undergone during evolution (21). Of all the mycobacteria, MIP has the least representation in L (3%) and E (amino acid transport) (4.3%) categories of genes. Although the distribution of genes is a species-specific attribute, variations in gene distribution across different lineages could provide an idea about the role of genomic fluidity in shaping the behavior of mycobacteria as saprophytes or host-adapted pathogens. Hence, to get a comprehensive picture of habitat transformation, mycobacterial species were classified in two groups according to their known attributes: pathogenic (PGN) comprising of M. tuberculosis complex (including M. marinum and M. ulcerans) and M. avium complex and saprophytic or environmental (ENV) mycobacteria comprising of M. smegmatis, M. vanabaalenii, M. gilvum and others. MIP was placed in between saprophytic and pathogenic mycobacterial species because of its unique intermediate position and these two groups were investigated for effect of gene variations in different COG classes with MIP as a common background (4). A two-way ANOVA analysis was performed to ascertain the statistical significance of analysis. While the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes, i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function) [P < 0.001, 0.01, 0.001, 0.01, 0.001 and 0.001, respectively], ENV–PGN transitions involved extensive gene variations (Figure 4). In addition to the gene reduction observed in the earlier mentioned classes, reduction was also noticed in genes related to energy metabolism (C, G, I and Q [P < 0.05, 0.05, 0.001 and 0.05, respectively]) and a significant increase in L (replication, recombination and repair, P < 0.01) and N (cell motility and secretion, P < 0.01) related genes in ENV-PGN transition. Noticeably, the habitat change from MIP to PGN lineages was primarily due to the loss of genes involved in I (lipid transport and metabolism, P < 0.001) and Q (secondary metabolite biosynthesis and transport, P < 0.001) and gain of genes in L, E and S [P < 0.001, 0.001 and 0.05, respectively] (Figure 4). This observation augurs well for a reduced habitat diversity of pathogenic mycobacteria and indicated toward the role of genomic fluidity within selected gene functions towards habitat specification. An increase in the representation of ‘L’ with the advent of pathogenicity offers an interesting paradigm, which warrants further studies in the model organisms. Figure 4. Quantitative analysis of gene variations involved in habitat transformation in mycobacteria. This cartoon depicts variations across major functional gene groupings as mycobacterial species adapted to a pathogenic lifestyle from free-living environmental mycobacteria. Red lines denote loss of genes while the green ones denote the gene gain with a change of habitat. Although the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function), ENV–PGN transitions involved extensive gene variations and are consistent with the intermediate evolutionary position of MIP. Significantly, only two major gene categories reported gain of genes associated with the advent of pathogenicity: L (DNA replication, recombination and repair) and E (amino acid transport and metabolism). But transition from purely saprophytic lineage to pathogenic habitat is associated with genes categorized into ‘L’ only, which also contains transposon elements. Indeed, we observe that saprophytic mycobacterium like MIP has only 38 transposons-like elements compared with 302 found in similar-sized pathogenic mycobacterial species M. ulcerans. A two-way ANOVA analysis was used to ascertain statistical significance. Role of natural selection in speciation in Mycobacterium Measurement of the rate of non-synonymous (leading to change in amino acid) and synonymous (silent) nucleotide substitutions in protein-coding DNA sequences is the most referred criterion for detecting natural selection in molecular evolutionary analysis (46). Significantly higher non-synonymous nucleotide substitutions (Ka) over the synonymous (Ks) ones are interpreted as an evidence of positive natural selection. Hence, to understand the contribution of selection in speciation, we have used closely related and phylogenetically independent species of M. avium complex of which MIP is a predecessor (19). Orthologs were identified using Inparanoid tool (29) and dataset of ∼2600 gene pairs representing >80% of the orthologs shared among different species of MAC was obtained to perform comparative analysis of the rate of selection for human-adapted (MIP–MAH) and animal-adapted (MIP–MAP) niches from a saprophytic MIP. The evaluation of rate of selection (Ka/Ks) revealed strong purifying selection (∼0.06) acting on both human and animal adapted lineages. However, further resolution of analysis based on protein function revealed a significant difference in the rate of selection only for the genes involved in energy production and conversion (C) (P < 0.03, unpaired t test) (Figure 5). Also, very few genes, mostly distributed in metabolic pathways, were found to have undergone strong positive selection (Supplementary Table S2) suggesting their relevance in undergoing niche-specific adaptations in Mycobacterium. A very strong positive selection (>50 times of average rate) was observed in ComEC (MIP2580) (47), the competence protein required for exogenous DNA uptake during natural transformation, which can critically influence the ability to acquire foreign DNA in microbial species. Incidentally, we found this gene to be pseudogenized in MAP, which usually results from an excessive positive selection. In a recent study (48) based on SNP analysis, it was argued that recombination may influence the rate of selection in extremely closely related species of M. tuberculosis complex (average nucleotide identity >98% across different species). Even though MIP is likely to have minimal homologous recombination events because of sequence heterogeneity with MAP and MAH, the likelihood of recombination and lateral gene transfer influencing the rate of selection cannot be completely discounted. Figure 5. Role of natural selection in speciation in MAC. Analysis of average rate of natural selection (Ka/Ks) among MIP–MAP and MIP–MAH lineages revealed the presence of a similar purifying selection (46). This implies that both mycobacterial lineages have undergone an independent evolution into their respective host adapted forms from MIP. However, a significant skew in selection rate (Ka/Ks) is observed in genes categorized into energy production and conversion (C), and thus establish the role of metabolism-related genes in the evolution of host tropism. Also, a strong positive selection (>50 times of average selection) was observed on ComEC gene that encodes a competence protein required for DNA uptake and natural transformation (47). Such a strong selection on this gene indicates that ComEC has played an important role in modulating the efficiency of DNA uptake during mycobacterial evolution. In fact, in the case of MAP, this gene is found to be pseudogenized, which usually results from an excessive positive selection. Identification of laterally transferred genes reveals massive gene acquisitions and mosaic architecture of MIP genome Identification of laterally acquired genes is an important paradigm, which is cardinal to gain a deeper insight into microbial evolution (49). Hence, after analysing the role of natural selection in speciation, we were keen to analyse the contribution of lateral gene acquisitions in MIP. The precise and accurate prediction of lateral gene transfer (LGT) events in an organism is challenging. First, detection of LGT may be influenced not only by source, size and quantity of lateral transfer but also by the genetic features associated with the recipient or host genome (50). Besides, LGT takes place by a variety of means and different tools may be required for better detection of LGT based on specific mechanisms of gene transfer (51). It is also known that different surrogate methods detect lateral acquisitions of different antiquities (52). Hence, all LGT are not amenable to detection by a single parametric method and the application of a combination of different methods is recommended to improve sensitivity of detection in different possible situations (50). However, while the simple addition of predictions from individual methods may increase false-positive rates, the consideration of strictly overlapping predictions as the inclusion criteria for LGT predictions is counterproductive because of the limited overlap of genes observed between different approaches (52,53). Nonetheless, it has been argued that even if the errors inherent to these individual methods are added, the overall benefit is worthy (50). Hence, to predict laterally acquired genes in MIP, we used three different parametric methods based on anomalous GC content of each ORF, genomic signature analysis and Alien Hunter predictions to score for likely LGT candidates. The genes were scored as laterally acquired only if they were predicted more than once. This would not only provide sensitivity of detection but also reduce the number of false-positive predictions associated with individual methods. We further augmented our analysis by using information on phylogenetic approaches and phyletic distribution of MIP genes in other mycobacterial species as an additional stand alone criterion to score LGT genes (54). Analysis of atypical GC content of ORFs (34) identified ∼28.5% (1503/5270) of MIP genes as putative candidates for LGT. A similar analysis with M. tuberculosis (MTB), M. avium paratuberculosis (MAP) and M. avium subsp. hominissuis (MAH) could only identify 4.3, 6.5 and 11.3% genes, respectively (55). Genomic signature approach could identify ∼33% of MIP genes as candidate LGT’s as compared to MTB, MAP and MAH wherein this approach yielded only 6%, 6.3 and 9.3% genes, respectively. Alien Hunter (32) predicted 85 probable laterally acquired regions (AL) comprising of 1298 (24.63%) ORFs (Supplementary Table S3). The regions around ‘ori’ (15 kb on both sides, upstream as well as downstream) and one harboring ribosomal genes were excluded from the evaluation to remove any possible bias. By using similar criteria with Alien Hunter, however, we could identify putative laterally acquired genes in MTB (21.5%), MAP (15.35%) and MAH (24.2%). More than 42% of MIP genes predicted by Alien Hunter are also shared by genomic signature analysis. A similar analysis with MTB, MAP and MAH showed an overlap of 17% (148/865), 31% (215/678) and 33.7% (421/1247), respectively, between Alien Hunter–predicted genes and genomic signature–based predictions. After applying our ‘majority’ based inclusion criteria, ∼6.2% of the genes emerged as laterally acquired in MTB, while MAP and MAH have 8.3 and 10.2% genes as LGT, respectively. By using this approach, ∼34% of MIP genes emerged as laterally acquired, which is significantly higher than in other mycobacterial species analysed in this study. A comparative analysis of MIP ORFs based on their restricted distribution within the other mycobacterial genomes identified additional 939 ORFs as plausible lateral acquisitions. This included 362 ORFs harbored by 93 defined RRDs (regions of restricted distribution of genes) in MIP (Supplementary Table S4) and 261 ORFs present in alien regions. The incongruence observed in the phylogenetic analysis of some of these genes substantiated their laterally acquired nature. Overall, 50.5% (2664/5270) of MIP ORFs appear to be laterally acquired highlighting thereby the scale of evolutionary novelties undergone by this microbe (Figure 6). This study represents the first report of such massive gene acquisitions in mycobacteria and suggests mosaic architecture of MIP genome. Figure 6. Depiction of lateral gene acquisitions in MIP. Each column depicts one MIP gene and each row depicts one mycobacterial genome (total 18 genomes comprising of M. tuberculosis complex, M. avium complex and saprophytic mycobacterial species–see Materials and Methods and Supplementary Table S1 for species list); green and red denote presence and absence, respectively, of the MIP gene in other genomes. Pink denotes the regions predicted by Alien Hunter (32). Dark yellow represents RRDs, while black columns denote recently acquired genes identified by using atypical gene content and gene signatures (34, 35). Orange arrow denotes position of tRNA molecules, blue denotes genes with homologs in genomes other than mycobacteria, while brown denotes absence in the COG database. A very good overlap is observed between genes identified by using different methods. Most of the alien regions and RRD’s overlap with red, substantiating the effectiveness and accuracy of our approach. It is noteworthy that over 50% of MIP genome has emerged as laterally acquired, the highest reported so far for any of the mycobacterial species. The figure is scaled to approximation with each figure row denoting 1 Mb of genome and every tick mark denoting 100 kb along the lane. Analysis of laterally acquired genes by using COG functional classification revealed the maximum gain in lipid transport and metabolism category (I) followed by transcription (K)-related genes, which are usually under-represented among laterally acquired genes in prokaryotes (56). This was followed by the genes affiliated to secondary metabolites biosynthesis, transport and catabolism (Q) and energy production and conversion (C); these four categories together constitute ∼35% of the total lateral acquisitions. In addition, the LGT predictions based on atypical GC content of ORFs (34) and further validated by genomic signatures 1478 (28.05%) (Figure 7A) appear to retain their native genomic imprints, which are yet to be masked by natural selection. This points toward their relatively recent acquisition and hence, their likely source could be ascertained (33). Analysis based on genomic signatures revealed that majority of these recently acquired genes (∼85%) are most likely derived from actinobacterial species (Figure 7B) like Streptomyces (∼25%), Amycolatopsis (∼15%), Rhodococcus (7.5%) and Frankia (6.5%). These gene acquisitions might have been mediated by physical proximity and close interactions among different actinobacteria. Figure 7. Identification of recent lateral gene acquisitions in MIP and their analysis. (A) Individual gene signatures of recently acquired MIP genes along with the whole genome signature of MIP establish the alien nature of respective genes (33). These gene signatures are based on the frequency of distribution of tetranucleotide pattern across the whole genome and individual genes of MIP, which are color coded to generate a visual impression. (B) Distribution of recently acquired genes with respect to their most likely source of acquisition. Based on the genomic signatures, our analysis revealed that majority of these recently acquired genes (∼85%) are possibly derived from actinobacterial species. Mobile elements based gene acquisition in MIP are dominated by plasmid-mediated lateral gene transfers LGT events are usually mediated by mobile genetic elements like phages, transposons and plasmids. BLAST analysis of laterally acquired genes against ACLAME (36), a database of mobile elements comprising all known phage genomes, plasmids and transposons, indicated mobile elements as likely source to 27.4% of these putative laterally acquired ORFs (<e−20). Majority of these genes exhibit similarity with plasmids and extremely small fraction with phages (2%) and IS elements (∼1.2%). The relative paucity of phage and IS elements mediated gene acquisitions and abundance of plasmid-acquired genes in case of MIP is surprising. In comparison with other mycobacterial species of similar sized genomes such as M. ulcerans (chromosome size ∼5.6 Mb), which has 302 IS elements/transposons (23), MIP has merely 38 genes harboring sequences consistent with IS signatures. This is consistent with our earlier analysis based on genomic fluidity across different mycobacterial lineages where variation in the number of transposable elements, which are classified in category L, was found to be associated with habitat diversification. It is tempting to envisage that MIP may harbor specific genomic determinants that either provide immunity from phages and transposons, or else predispose MIP toward plasmid-based gene acquisitions. MIP has a relatively higher number of CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat) elements compared with other mycobacterial genomes (7 as compared with 1–2 in other mycobacterial genomes) (37). These CRISPR molecules not only provide immunity against invasion by phages and viruses (57) but also limit the mobility of IS elements in genome and help in their excision from genome (58). MIP also lacks RD1 region, the loss of which facilitates efficient conjugation with plasmids and other chromosomes to promote rapid acquisitions of genes (59). In addition, we found that MIP is particularly enriched in transporters of septal DNA translocator family (6 as against 1–2 present in other mycobacteria) (Table 3) that are known to bring out rapid acquisition of genes by mediating cell to cell DNA transfer during plasmid conjugation (60). The abundance of these genomic determinants coupled with the absence of RD1 locus may contribute to the propensity of MIP towards plasmid-mediated gene acquisitions. Table 3. Comparative transporter analysis of MIP with other mycobacterial species MSMEG, M. smegmatis; MAP, M. avium susbsp. paratuberculosis; MTB, M. tuberculosis. Effect of lateral gene acquisitions on different gene families in MIP Two distinct observations have emerged from this analysis: (i) a large number of lateral gene acquisitions in MIP have been mediated through mobile elements with only a small contribution through phages and (ii) gene distribution among laterally acquired regions in MIP follows a skewed pattern with respect to function as indicated by over-representation of genes belonging to certain categories. To know the influence of LGT events on distribution of genes across MIP gene families, we performed a comprehensive analysis that showed that CYP450 is the largest gene family in MIP with 66 members and a gene density of ∼12/Mb. This is remarkably high in comparison with other mycobacterial species such as M. tuberculosis (4.5/Mb), M. smegmatis (5.6/Mb) and M. marinum (7.1/Mb). The analysis of Cytochrome P450 database (http://drnelson.uthsc.edu/CytochromeP450.html (15 August 2012, date last accessed)) revealed that MIP harbors the highest number of genes from CYP450 family among prokaryotes sequenced so far and ∼ 46% (30/66) of these genes are laterally acquired. Approximately 27% of these genes are recent acquisitions, suggesting a recent expansion of CYP450 family. Approximately 11% of CYP450 genes were identified as unique by International CYP450 nomenclature commission and have been classified into three new families and two new sub-families of CYP450 (Table 4). The context-based analysis based on gene neighborhood suggested the role of these genes in the utilization of unusual carbon sources, a key to adaptability and survival of MIP at its most likely habitat at soil–water interface (19). Table 4. List of CYP450 ORFs unique to MIP aNomenclature as per International Committee of CYP450 nomenclature. ‘A1’ refers to the first member of a new CYP450 family, whereas ‘B1’ refers to the first member of a new CYP450 sub-family. MIP has 66 genes of PE–PPE family (16 PE and 50 PPE), which encompass complete repertoire of PPE genes present in various MAC species. These PPE genes appear to be selectively inherited by various species of MAC as evident from their distribution in MAP (36 genes), MAH (38 genes), M. avium subsp. avium (36 genes) and M. intracellulare (38 genes), respectively (61). This observation substantiates that evolutionarily MIP is a predecessor of MAC and endorses our earlier findings based on the rate of natural selection that speciation and habitat diversification has taken place independently from MIP. Comparative analysis of PE–PPE genes in MIP (with Nr database at NCBI) highlighted that five genes of PPE–SVP sub-family are unique to MIP. In addition, several of its PE genes are evolutionarily closer to those belonging to M. tuberculosis complex, which is in agreement with the shared evolutionary history of MIP and predecessors of M. tuberculosis complex (19). The presence of PPE gene family is unique to mycobacteria among prokaryotes; however, its origin remains unknown. A large number of these genes are laterally acquired in MIP prompting us to speculate that PE–PPE genes might have been introduced into mycobacteria through mobile elements. Majority of PE–PPE gene clusters in MIP harbor genes related to mobile function activity such as phages, tRNA or 13e12 repeats in their vicinity. Besides, several of these PE–PPE genes exhibited the presence of Ig-like motifs often present in the proteins of tailed double stranded DNA bacteriophage particles. However, the most clinching evidence about the origin of these PE–PPE genes in mycobacteria emerged from the presence of a PPE protein containing intact prophage of ∼40 kb that we observed in MAH genome during ACLAME analysis (36), a direct evidence for a phage-mediated acquisition of PE–PPE family members. Hemerythrins: a versatile gene family laterally acquired in MIP Most surprising finding of MIP gene analysis, however, is the unusual presence of ORFs belonging to hemerythrin proteins (Hr) family, the oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids. MIP has 10 ORFs having significant similarity to hemerythrin (Hr) genes (Figure 8) in comparison with one or two copies in most prokaryotes (62). The prevalence of ‘Hr’ proteins strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface (OAI) and in the anoxic regions of the marine habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62). Figure 8. Distribution of hemerythrins in MIP. Domain mapping and BLAST searches indicated the presence of 10 ORFs belonging to hemerythrin genes in MIP. Hemerythrin proteins (Hr) are oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids and they usually have only to 1 or 2 copies in most prokaryotes (62). The abundance of these genes strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface and in the anoxic regions of the habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62,63). The ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could also provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style has been shown to grow at 0% oxygen, to reach a plateau in 3 days and die thereafter (64). MIP2918 and MIP2747 did not harbor a definite ‘Hr’ domain and are identified by BLAST searches against NCBI ‘Nr’ database. Domain mapping and comparative genomics with available mycobacterial genomes indicated the presence of two Hr domains in several MIP ORFs such as MIP2750, MIP2759, MIP5034 and MIP6380, a trait usually restricted only to proteobacteria (62). We could also identify putative homologs of ‘hr’ genes in mycobacteria such as M. marinum and M. ulcerans (1 each), M. tuberculosis (3), M. gilvum and M. smegmatis (4 each), M. avium complex (5), other environmental mycobacteria like M. sps. KMS, M. sps. JLS and M. vanabaalenii (three each) and none in the case of M. leprae. Considering the variations in the number of ‘hr’ genes in various species of mycobacteria and their significant sequence heterogeneity, it appears that acquisition of hemerythrins could have been a selective and independent event facilitating mycobacterial evolution. Incidentally, 90% of these ORFs in MIP are laterally acquired with over one-half of them being recent acquisitions. It should be noted that the efficiency of hemerythrins as oxygen storage proteins is directly dependent on oxygen concentration in its surrounding environment (63). The selective enrichment of MIP with Hr proteins and the ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style, can manage to grow at 0% oxygen, reach a plateau in 3 days and die thereafter (64). Membrane transporters in MIP Transport systems play a critical role in the life-endowing processes such as metabolism, metal homeostasis and secondary metabolite production, affecting thereby the physiology and lifestyle of the organism. In MIP, a total of 222 genes were annotated as membrane transporters comprising ∼4.2% of the total gene content and a transporter density of 39.73/Mb (Table 3). This is an apparent reflection on the unique evolutionary position of MIP as its transporter density is significantly lower than that of saprophytic M. smegmatis (60.43/Mb) and higher than M. tuberculosis (33.64/Mb), MAP (35.2/Mb) and M. leprae (17.2/Mb). It is likely that M. smegmatis and MIP, being saprophytic in nature need extensive transport machinery to support their life style, whereas intracellular organisms owing to their relatively stable environment have a reduced transporter requirement. Comparative analysis revealed a selective abundance of transporters belonging to septal DNA Translocator family (S-DNA-T) with distinct homology with FtsK/SpoIIIE proteins, which could primarily be responsible for high propensity of MIP toward plasmid conjugation and gene acquisitions (60). Another unique seven gene cluster of CPA3 (Proton Antiporter-3) family present in RRD79 (Supplementary Table S4), the largest region with restricted distribution, is responsible for species defining ability of MIP to grow on 5% NaCl (4). The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila (Figure 9) that can grow in high salt concentration (65). Along with Mn2+ transporters, MIP also possesses an abundance of catalases (4) and superoxide dismutases (5), some of them being laterally acquired, which mitigate oxidative stress and reflect not only upon the primitive origin of MIP but also equip it for intracellular adaptations. MIP can withstand the carbon starvation as evident from our experiments on nutritional stress. After 5 days of growth in PBS without any media or nutritional supplement, MIP exhibited no significant reduction in log CFU, reflecting upon its potential to undergo longer period of starvation. Thus, MIP appears to have fine-tuned its specific transport abilities by lateral gene acquisitions to gain physiological attributes required for its unique habitat. Figure 9. Phylogenetic analysis of CPA3 family cluster. This cluster is unique to MIP among mycobacterial species and each ORF of this complex encodes different subunits of a unique Na+/H+ antiporter. All genes have been laterally acquired as a unit, hence, a representative single gene is used to perform phylogenetic analysis by using maximum likelihood method available in Phylogeny Fr. Server (43). The numbers along the branches denote bootstrap values. The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila that can grow under salt conditions (3% NaCl w/v) (65). Genome-enabled ecophysiological and metabolic attributes of MIP and influence of LGT events The presence of a very high number of alternate sigma factors (24 in comparison with 13 in M. tuberculosis and MAP) endows MIP with a complex transcriptional flexibility necessary for it to respond to its unique life style (66). The interface life style (soil/water) of MIP, as substantiated by its genetic features, prompted us to look for the bio-degradative capabilities of MIP. In addition to the abundance of CYP450 genes, MIP has also laterally acquired homologs of 3-octaprenyl-4-hydroxybenzoate carboxy-lyase that are involved in anaerobic metabolism of phenol during degradation of plant substrates (67). Besides, as enlisted in Supplementary Table S5, it also possesses complete cyanide and thiocyanate biodegradation machinery including the complete enzyme complex (MIP3820-22) of thiocyanate hydrolase (alpha, beta and gamma subunits). This complex degrades thiocyanate and produces CO2 and NH3 that can be used by MIP as a nitrogen source (68). Notably, thiocyanate gene cluster is absent from all the pathogenic mycobacteria analyzed in this study (including M. abscessus and the opportunists of MAC). Although the ability of MIP to degrade different compounds and utilize diverse sources of carbon is perspicuous, the presence of an intact hydrogenases enzyme complex (Table 5) provides evidence for its chemolithotrophic nature even though further research is required to establish the functionality of this complex. The loss of hydrogenases concurs well with the advent of pathogenicity in mycobacteria across different lineages, an observation corroborated by previous studies on mycobacterial hydrogenases (69). Loss of accessory protein coding genes which are required for the maturation and assembly of the hydrogenase complex as well as integration of different metal ions renders this complex non-functional in immediate descendents of MIP (i.e. MAC species). Table 5. List of MIP ORFs encoding hydorgenase gene cluster In addition to these unique metabolic characteristics of MIP, fundamental differences were observed in the organization of lipid metabolic machinery, which is cardinal to the physiology and behavior of mycobacterial species (70). Although, the genetic machinery required for synthesis and modification of mycolic acids is present in MIP (70), a major reshuffle is observed in methoxy mycolic acid synthase gene operon. Sequence analysis further suggested the absence of papA5, a gene encoding polyketide-associated protein (Pap) required for the synthesis of virulence associated phthiocerol dimycocerosate (PDIM) (71). These traits are in agreement with the observation that members of MAC do not synthesize PDIM’s. Further analysis demonstrated the presence of a glycophospholipid (GPL) biosynthesis locus, which is a hallmark of antigenic diversity in MAC and appears to be laterally acquired in MIP. This gene cluster in MIP harbors an ORF (MIP4595) sharing significant similarity with ‘gsc’ gene of MAP. This gene constitutes a pathogenic island in pathogenic mycobacteria including M. tuberculosis (72). However, a comparative analysis of this locus with other MAC sequevars revealed the interruption of this locus by a six-gene cluster exclusive to MIP with four of these six genes being transposable elements. Thus, this GPL locus acts as a hotspot for transposon integration and is likely to play an important role in MIP’s unique biological attributes by influencing GPL biosynthesis. Non-pathogenic attributes of MIP and immunome analysis MIP as discussed before is non-infectious in mouse, guinea pig and monkey models (2,6,17,73). However, investigation of MIP against PAIDB (74), the pathogenic islands database identified the presence of three regions in MIP with genomic attributes similar to PAGI islands of Pseudomonas aeruginosa. These included a gene cluster (MIP227–MIP247) similar to PAGI 1 pathogenic island of P. aeruginosa isolated from a patient with a urinary tract infection (39); a PAG3 like genomic island (MIP272–MIP283) and another region homologous to PAI (MIP333–MIP343), a region similar to tcd (toxin complex D) island of Photorhabdus luminescens (75). A comparative analysis of MIP proteins with virulence factors database (VFDB), a comprehensive compilation of all known virulence factors, also revealed the presence of most of the genes in MIP that are reportedly associated with virulence in other mycobacterial species (40). Pathogenesis is a multi-factorial phenomenon that requires pathogen to attach, infect, sustain, proliferate and eventually disseminate itself inside the host. Hence, the loss of a component responsible for any of these functions is likely to result in the attenuation of virulence or pathogenicity. Thus, despite having PE-PPE genes and mce1 operon, which enable mycobacteria to invade the host cell, MIP lacks both mce2 and mce3 operons, which are essential for causing macrophage infections by M. tuberculosis and M. avium (76–78). The mce3 as well as mce2 mutants of M. tuberculosis are attenuated in mice although the latter shows no growth defect in macrophages. The mce2 mutant of M. tuberculosis elicits an altered immune response and exhibits no lung pathology along with enhanced survival in mice (76–78). Likewise, MIP lacks phospholipase (plc) ABCD genes, which are responsible for acquiring host fatty acids for their use as a potential carbon source during persistent infections both in tuberculous and non-tuberculous mycobacterial infections (79). Another factor crucial for mycobacterial pathogenicity is associated with the presence of latency-related genes that confer on mycobacteria the ability to survive and grow in microaerophilic environment for prolonged period of time. The devS/devR two-component system, essential for maintenance of dormant state in low oxygen conditions, is conspicuous by its absence in MIP (79). In addition to RD1 locus and toxin–antitoxin system, in silico studies further identified MIP as a natural mutant of anthranilate phosphoribosyltransferase gene trpD, which is involved in tryptophan biosynthesis (81,82). The absence of these critical determinants may severely compromise MIP’s ability to survive inside the host as the infection with MIP has been found to be self-limiting and clears off within 6–7 weeks (17). The limited survival of MIP in low oxygen inside macrophages despite the absence of devS/devR two-component system can be attributed to the prevalence of ‘Hr’ proteins. In silico analysis revealed a much higher fraction of putative antigenic proteins in MIP in comparison with BCG (Figure 10), and a majority among them being contributed by lateral acquisitions emphasizing the importance of LGT events in augmenting its immune potential. Besides, the significant sequence heterogeneity observed between MIP and M. tuberculosis proteins (as mentioned earlier) would render MIP proteins acquiescent to generate novel T-cell epitopes resulting in an enhanced immune response. Our analysis revealed that of the 36 proteins shared by MIP and M. leprae, which were absent in M. bovis BCG, 29 were highly immunogenic in nature (Table 6). The most prominent putative antigenic proteins were MIP0340 and MIP5962, both belonging to Hsp20 family and share a close similarity with the 18 kDa small heat shock protein of M. leprae (83). This protein bears several T-cell epitopes and generates CD4+ T-cell mediated immune response, a hallmark of protection against tuberculosis. Similarly, MIP7697 is a homolog of M. leprae protein MLep2649 that encodes a protein with excellent T-cell stimulating properties, which responds to more than 60% of tuberculosis patients (84). The presence of such immunodominant and productive antigens in MIP may potentiate the expression of an antigenic profile better than BCG against M. tuberculosis infection. Figure 10. Comparative analysis of immunomes of MIP and BCG and contribution of LGT events. In silico immunome analysis of MIP and its comparison with BCG revealed the presence of a greater number of antigenic proteins in MIP (41). This may subscribe to the unique potential of MIP for immunomodulation against various types of infections. Noteworthily, a significant proportion of these immunogenic proteins appear to be laterally acquired in MIP. Table 6. List of MIP ORFs shared between MIP and M. leprae and absent from BCG aAs predicted by in silico analysis of MIP proteins by VAXIJEN software at default parameters (41). In summary, different analyses performed in this study establish that MIP represents an organism at a unique phylogenetic point as the immediate predecessor of opportunistic mycobacterial species of MAC. It is also evident that natural selection in MAC has acted in a preferential manner on specific categories of genes leading to reduced habitat diversity of pathogenic bacteria, and thus facilitating host tropism. The genome of MIP is ∼5.6 Mb in size and is shaped by a large number of lateral gene acquisitions thus revealing, for the first time, mosaic architecture of a mycobacterial genome. Thus, this study offers a paradigm shift in our understanding of evolutionary divergence, habitat diversification and advent of pathogenic attributes in mycobacteria. A scenario for mycobacterial evolution is envisaged wherein the earliest evolving soil derived mycobacterial species like MIP underwent massive gene acquisitions to attain a unique soil–water interface habitat before adapting to an aquatic and parasitic lifestyle. These lateral acquisition events were selective and possibly facilitated by the presence of specific genetic factors (i.e. ComEC) that induce competence to acquire large chunks of DNA to confer immediate survival advantage to the recipient organism. The genes, such as members of ‘Hr’ family, acquired to assist mycobacteria survive in fluctuating oxygen levels, would have been instrumental in the initial advent of pathogenicity in the aquatic opportunistic mycobacterial species. Subsequently, mycobacterial species tuned their genetic repertoires to respective host adapted forms with a high degree of genomic fluidity aided by selective lateral gene acquisitions and gene loss by deletion or pesudogenization (19). Importantly, a significant increase in transposon elements in the pathogenic mycobacteria as compared with MIP, for the first time, suggests their possible role toward mycobacterial virulence and would be interesting to explore. In addition, comparative genomic analysis revealed a higher antigenic potential of MIP subscribing to its unique ability for immunomodulation against various types of infections and presents a template to develop reverse genetics based approaches to design better strategies against mycobacterial infections. ACCESSION NUMBERS MIP genome has been submitted to the genome depository at NCBI (accession no. CP002275). SUPPLEMENTARY DATA Supplementary Data are available at NAR Online: Supplementary Tables 1–5 and Supplementary Figure 1. FUNDING MIP Genome sequencing program was funded by the Department of Biotechnology, Government of India. V.S. acknowledges the Council of Scientific and Industrial Research (CSIR), New Delhi, for the award of research fellowship. Akhilesh K. Tyagi, Anil Kumar Tyagi and S.E. Hasnain are thankful to Department of Science and Technology, Government of India for J.C. Bose National Fellowships. S.E.H. is a visiting professor, King Saud University, Riyadh, Kingdom of Saudi Arabia and J.P.K. is a Tata Innovations Fellow. Funding for open access charge: University of Delhi, India. Conflict of intertest statement. None declared. Supplementary Material Supplementary Data

Document structure show

Title	Massive gene acquisitions in Mycobacterium indicus pranii provide a perspective on mycobacterial evolution
Abstract	Understanding the evolutionary and genomic mechanisms responsible for turning the soil-derived saprophytic mycobacteria into lethal intracellular pathogens is a critical step towards the development of strategies for the control of mycobacterial diseases. In this context, Mycobacterium indicus pranii (MIP) is of specific interest because of its unique immunological and evolutionary significance. Evolutionarily, it is the progenitor of opportunistic pathogens belonging to M. avium complex and is endowed with features that place it between saprophytic and pathogenic species. Herein, we have sequenced the complete MIP genome to understand its unique life style, basis of immunomodulation and habitat diversification in mycobacteria. As a case of massive gene acquisitions, 50.5% of MIP open reading frames (ORFs) are laterally acquired. We show, for the first time for Mycobacterium, that MIP genome has mosaic architecture. These gene acquisitions have led to the enrichment of selected gene families critical to MIP physiology. Comparative genomic analysis indicates a higher antigenic potential of MIP imparting it a unique ability for immunomodulation. Besides, it also suggests an important role of genomic fluidity in habitat diversification within mycobacteria and provides a unique view of evolutionary divergence and putative bottlenecks that might have eventually led to intracellular survival and pathogenic attributes in mycobacteria.
Body	INTRODUCTION Mycobacterium indicus pranii (MIP) is a saprophytic mycobacterial species that is known for its immunomodulatory properties (1–11). In late 70s, this bacterium, initially coded as Mycobacterium ‘w’, was selected from a panel of atypical mycobacteria for its ability to evoke cell mediated immune responses against M. leprae in leprosy patients (2,9). MIP, which shares antigens with both M. leprae and M. tuberculosis, provides protection against M. tuberculosis infection in mice (3,10,12,13) and accelerates sputum conversion in both type I and type II category of tuberculosis (TB) patients when used as an adjunct to chemotherapy (14,15). In HIV/TB co-infections, a single dose of MIP converted tuberculin −ve patients into tuberculin +ve in >95% of the cases (16). This attribute is unique to MIP because similar application of other saprophytic mycobacteria such as M. vaccae does not provide commensurate protection (17). Based on its demonstrated immunomodulatory action in various human diseases, MIP is the focus of several clinical trials (Table 1) and successful completion of one such trial has led to its use as an immunotherapeutic vaccine ‘Immuvac’ against leprosy (18). However, very little information is available about MIP’s molecular, biochemical, genetic and phylogenomic features. Table 1. Ongoing clinical trials of MIP in a diverse set of diseases Recently, in a molecular phylogenetic study by using candidate marker genes and FAFLP (fluorescent-amplified fragment length polymorphism techniques) fingerprinting assay, we showed that MIP belongs to a group of opportunistic mycobacteria and is a predecessor of M. avium complex (MAC) (19). A comprehensive analysis of cellular and biochemical features of MIP along with chemotaxonomic markers such as FAME (fatty acid methyl ester) analysis and comparison with other mycobacterial species established that MIP is endowed with specific attributes (4). It has a growth rate (time of colony appearance ∼6–8 days) that is faster than the typical slow growers such as M. tuberculosis (∼3 weeks) and slower in comparison with typical fast growers, such as M. smegmatis (∼3 days), and thus placing MIP somewhere in-between the slow and fast grower mycobacterial species (4). In Mycobacterium, fast growers usually represent non-pathogenic organisms whereas slow growers are usually specialized pathogens. MIP does not cause any infection in mice, guinea pigs and monkeys, the animal models in which it has been tested (6). Biochemical analysis also showed that MIP shares several features that are exclusive to either slow growers or fast growers (4). Even the FAME profiling of MIP, a key test for appropriate taxonomic placement of microbes, and its comparison with the fatty acid complement from other mycobacterial species corroborated the placement of this saprophyte in between fast and slow growers (4). Thus, MIP represents an organism placed at an evolutionarily transitory position with respect to a fast grower and a slow grower or a saprophyte and a seasoned pathogen. It is known that mycobacterial species represent one of the most dramatic examples of host tropism and habitat diversification. Mycobacterium has more than 125 notified species including saprophytes such as M. smegmatis, immunomodulators such as M. habana, M. vaccae and MIP, opportunist M. avium and strict intracellular pathogens like M. tuberculosis and M. leprae. This unmatchable competence of mycobacterial organisms and their diverse physiological characteristics can be attributed to the genome dynamics including genome organization, gene content, coordinated gene expression and ability to interact with the host machinery. An important unanswered question in this context remains as to how the soil-living saprophytic mycobacterial species turned into one of the most notorious intracellular pathogens. Thus, understanding of the genomic basis of habitat diversification could be crucial in evolving effective control measures against mycobacterial infections. Unfortunately, despite the publication of several mycobacterial genomes (20–23), the understanding and details of advent of parasitism within mycobacterial lineages remain obscure (especially with in MAC) although the evolution of niche adapted parasitic forms by genomic downsizing is an accepted norm in M. tuberculosis complex (21,23). In fact, formal genetic studies on species differences and divergences in mycobacteria have been severely limited by the unavailability of a related organism that represents the border of optimization between saprophytic and pathogenic mycobacterial species. In prokaryotic evolution, a few species such as Shigella flexneri and Yersinia pestis have been identified, which represent an early stage of host restricted adaptation by means of genome shedding (24). MIP because of its unique phylogenetic placement and associated biochemical features seems to be the first case of a mycobacterium species caught in transition just before it resorted to the pathogenic adaptations. Thus, it provides a unique opportunity to understand evolutionary divergence and putative bottlenecks responsible for the advent of intracellular mode of survival and pathogenic attributes in mycobacteria. We have sequenced complete MIP genome to gain an insight into its unique life style and molecular basis of immunomodulation. In addition, we have employed comparative genomics to understand the habitat diversification and bases and means of functional genetic correlates responsible for evolution of pathogenicity in ancestral mycobacterial lineages. MATERIALS AND METHODS Sequencing of MIP genome The genome sequence of MIP was determined by employing Sanger sequencing by using a hybrid strategy of sequencing shot gun libraries (2 and 5 kb) and partial sequencing of some clones of large insert sized (>125 kb) BAC (bacterial artificial chromosome) library. Briefly, genomic DNA was isolated and whole genome shotgun libraries with average insert size of 2–3 kb and 4–5 kb were prepared by hydroshearing. Fragments of required size were gel-eluted, blunt-ended and cloned in plasmid vector pUC19. Clones were randomly picked from libraries having more than 90% insert and sequenced by Sanger’s di-deoxy terminator chemistry on ABI 3700 machines. A high quality BAC library was also prepared (www.mwg-biotech.com (15 August 2012, date last accessed)) and end sequenced by employing Sanger’s method to create a physical map of MIP genome that assisted in gap filling and resolving the ambiguities in genome assembly. Gap closing and the re-sequencing of low-quality regions were performed by sequencing the PCR products and the appropriate plasmid clones. These data were assembled by using the PHRED-PHRAP-CONSED package of software on four processor SunFire V400 series of server. Identification of open reading frames (ORFs) was carried out with the help of GLIMMER gene prediction software (25). Protein localization analysis was carried out with the help of PSORTB (26). Comparative proteome analysis of MIP with other species Functional annotation was carried out on the basis of sequence alignment with the known mycobacterial proteins as well as the COG (clusters of orthologous groups of proteins) (27) database with the help of BLAST (28) package. Several perl scripts were developed in-house for data analysis. To understand the effect of gene variations on the habitat diversification in mycobacterial species with respect to MIP, we performed BLAST analysis of MIP proteome against the proteomes of 18 other mycobacterial species used in this study. They were assigned to specific lineages of pathogenic and saprophytic mycobacteria based on their characteristic features, habitat and available literature. The members of M. tuberculosis complex including M. marinum and M. ulcerans and those belonging to M. avium complex were categorized as pathogenic group whereas rest of them were grouped as environmental mycobacteria. The positive hits against MIP proteins were filtered out and remaining genes (unique with respect to species under investigation) were analysed for their function based on COG classification and were quantified. This dataset was obtained for each species of both groups and was viewed as variation in unique gene content in each function category with respect to MIP. Analysis of rate of natural selection (Ka/Ks analysis) To understand the role of selection on speciation in MAC, the orthologous group of genes between MIP and M. avium subsp. hominissuis (MAH- human strain) and MIP and M. avium paratuberculosis (MAP-animal strain), were identified by using InParanoid program (29). This method bypasses multiple alignments and phylogenetic tree-based conventional approaches to detect orthology and thus minimizes any bias arising due to alignment or phylogeny method in the identification of orthologs. First, all possible pair wise similarity scores that scored higher than a cutoff value (bit score ≥ 50, overlap ≥70%, e ≤10−10) were detected from all-against-all BLAST comparisons and then the reciprocal genome-specific best hits were marked as orthologs. The orthologs were subsequently classified based on functional categories as per the similarity searches against COG database. The orthologs were aligned by using ClustalW (30) and each alignment was manually inspected for its correctness. Pairwise estimates of the non-synonymous (Ka) and synonymous (Ks) substitution rates were obtained by KaKs_Calculator program by using a maximum likelihood method based on the HKY85 model (31). Analysis of lateral gene acquisitions in MIP A combination of parametric methods, comparative genomics and phylogenetic approaches was employed to predict laterally acquired genes in MIP. First of all, we employed the three most popular parametric approaches namely Alien Hunter (32), genomic signature analysis (33) and by analysing atypical GC content of each ORF. Alien Hunter implements an interpolated variable order motifs theory to predict compositionally deviating regions with the highest recall value. The genome sequence of MIP was scanned and the fine tuning of the co-ordinates of alien regions was carried out by using advance optimization algorithm available in Alien Hunter. Besides, each MIP ORF was analysed for its length and nucleotide composition with respect to total and positional G + C contents (G + C [T], G + C [1], G + C [2] and G + C [3]). The genes were considered as extraneous on the basis of G + C content, if their total G + C (T) content deviated by >1.5 ς from the mean value of their genome or if deviations of G + C [1] and G + C [3] were of the same sign and at least one of them was >1.5 ς (34). The genes shorter than 300 bp and the genes coding for ribosomal genes were excluded from this analysis to avoid any extraneous results. We further augmented our analysis of MIP genome by using genomic signature based method previously used for mycobacteria (33,35). The genes that were scored by more than one method in these analyses were considered as laterally acquired. The genes confirmed by both genomic signature and GC content based methods were referred as recently acquired and these signatures were used to ascertain their likely source of acquisition. Further, we used the power of comparative genomics by analysing MIP genes for their presence/absence across available mycobacterial species ≤e−10). MIP regions having a non-uniform gene distribution across various mycobacteria, which are not scored by Alien Hunter, were annotated as RRD (regions of restricted distribution of genes). RRD has been defined as the region in MIP genome, which harbors the genes that are absent in a minimum of 33% of species investigated in this study and is at least represented by three contiguous genes or a region of >3 kb. The genes, which are absent in more than 50% of the species investigated were then referred as laterally acquired in RRDs and elsewhere in MIP genome. All the genes identified as possible lateral acquisitions in MIP were probed against COG database to analyse the functional role of gene acquisitions. Besides, laterally acquired genes were analysed by BLASTP algorithm against ACLAME (36), a database dedicated for the classification of mobile genetic elements (MGEs). Other in silico analysis and stress experiments CRISPR analysis was performed by using CRISPRFinder (37). Annotation of transporter genes was carried out by TransAAP (38). Pathogenic islands were inferred from PAIDB (39), the pathogenic island database. MIP was analysed by using Virulence factor database (VFDB) to ascertain the status of genes associated with virulence (40). In silico prediction of antigenicity was carried out with VAXIJEN (41). PFAM (http://pfam.sanger.ac.uk (15 August 2012, date last accessed).) was used to analyse and draw protein domains in a scaled manner. Motif scan tool at MyHits web server (http://myhits.isb-sib.ch (15 August 2012, date last accessed)) was used for further analysis of proteins and motifs (42). Phylogenetic analysis was performed by using maximum likelihood method available in Phylogeny Fr. Server (43). Influence of nutritional stress on MIP was evaluated on the basis of viable cell count at different time points (44). Statistical analyses Variations in gene distribution across different lineages were analysed by two-way ANOVA followed by Bonferroni posttests. P < 0.05 was considered as statistically significant. For studying natural selection, Fisher’s exact test (built in KaKs_Calculator program) for the small sample was applied to justify the validity of Ka and Ks calculated in this study. Only the ortholog pairs with P < 0.05 were considered for further analysis to infer the rate of natural selection. Paired t test was performed to ascertain the significance in the rate of selection between different organisms (P < 0.05). Total number of mycobacterial species analysed in this study ( = 18) The genome sequence along with annotation for the following organisms were downloaded from NCBI genome databanks and used in this study: M. marinum, M. ulcerans, M. tuberculosis H37Rv, M. tuberculosis H37Ra, M. tuberculosis CDC1551, M. tuberculosis F11, M. bovis, M. leprae, M. bovis BCG, M. avium supsp. paratuberculosis, M. avium 104, M. smegmatis, M. gilvum, M. abscessus, M. vanbaalenii, M. sps. JLS, M. sps. KMS and M. sps. MCS. RESULTS AND DISCUSSION Genome sequencing and general features of MIP genome Sequencing of MIP (DSM 45 239T) genome was carried out by whole genome shotgun (WGS) approach. A total of 109 792 paired end reads, comprising of more than 10× coverage of MIP genome, were generated from randomly picked shotgun clones from both ∼2 and ∼5 kb shotgun libraries followed by gap filling and sequence improvement. Sequence assembly with PHRAP resulted in the assembly of 93 592 shotgun sequences leading to a single circular MIP chromosome of 5 589 007 bp (Figure 1). This was subsequently validated by a BAC end sequence based physical map of MIP genome. Mycobacterial genomes range from 3.5 to 7 Mb and MIP with a size of ∼5.6 Mb represents a moderate genome size, which is larger than all known organisms of MAC. The genome contains 5270 predicted ORFs (at a density of ∼1 gene/kb), a single rRNA operon and 45 tRNA genes; these ORFs account for ∼91% of the genome (Table 2). The mean G + C content of MIP genome is 68%. However, the cumulative nucleotide skew analysis revealed several regions with a G + C content clearly divergent from this mean value, which cover considerable area in MIP genome and constitute potential sites to investigate for laterally acquired genes (Figure 1). The putative ‘ori’ in MIP genome was identified by a relatively AT rich region with characteristic DnaA boxes and a typical gene order of ‘rpnP-dnaA-dnaN’. The ‘ATG’ was found to be the most frequent start codon (56.5%) followed by ‘GTG’ (37.5%) and ‘TTG’ (5.9%). Like M. tuberculosis, MIP has an even distribution of ORFs on both strands with respect to the direction of replication (2656 on the leading strand and 2614 ORFs on lagging strand) (4). PSORTB analysis indicated that 55.5% of MIP proteins are cytoplasmic in nature, 13.5% are localized in the cytoplasmic membrane and only 3.5% are extra-cellular in nature (26). However, the precise localization of 27.5% of the proteins could not be ascertained. Figure 1. Circular representation of MIP genome. Whole genome sequencing of MIP revealed that it harbors a single circular chromosome of 5 589 007 bp. The accuracy of genome data assembly is ensured by a BAC end sequence based physical map of MIP genome. The size of MIP genome is much larger than the genome of any member of M. avium complex and thus is in agreement with the progenitor status of MIP (19). The red and blue tracks represent ORFs predicted in the sense and anti-sense orientation in relation to the ori (origin of replication). The inner most track represents the GC skew wherein sharp peaks of violet and yellow represent regions of AT and GC richness, respectively, and constitute potential targets for lateral gene analysis . Table 2. General genomic features of MIP BLAST-based comparative analysis of MIP ORFs (at a cut off value of ≥70% amino acid identity) revealed their maximum similarity with MAC organisms, which are evolutionarily close to MIP (Supplementary Figure S1). This is followed by M. marinum with which MIP shares over 51% of its coding sequences (CDS) (Supplementary Table S1). This observation is consistent with the status of MIP as the progenitor of MAC and supports the idea of a shared aquatic past between saprophytic and pathogenic mycobacteria (19,45). With M. tuberculosis, MIP shares only ∼40% of its proteins. However, the number of MIP ORFs (∼68%) shared by closely related MAC species strikingly differs in comparison with other related mycobacteria, which usually share over 90% of coding sequences even at identity >95% (22). This divergence could be a critical component for the elicitation of a robust yet unique immune response upon vaccination with MIP. Functional classification of MIP proteins To facilitate functional studies, MIP proteins were subjected to BLAST analysis against the COG database, which serve as a platform for functional annotation of newly sequenced genomes and for studies on genome evolution (27). On the basis of similarity with COG proteins, it was possible to assign functions to ∼80% of MIP proteins but ∼20% of the proteins still remain un-annotated. More significantly, ∼7.5% of proteins are unique to MIP and show no significant homology with other proteins present in mycobacterial proteomes. Several of these candidate orthologs are present in gene clusters, which are absent from most of the other mycobacteria, and thus indicating the modular nature of gene acquisitions or deletions in mycobacteria. Our analysis shows that 41.5% of MIP proteins belong to ‘Metabolism’ category, 11.5% to ‘ISP’ (information storage and processing), and 9.5% to ‘CPS’ (cellular processes and signaling) whereas 16.7% are ‘poorly’ categorized proteins (Figure 2). Within ‘Metabolism’ category, the genes pertaining to lipid transport and metabolism (I) were over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins were related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%) and translational, ribosomal structure and biogenesis (J) (24.5%). In case of ‘CPS’, major representation comes from cell wall/membrane/envelope biogenesis (M) (27.6%) followed by posttranslational modifications (O) and signal transduction mechanisms (T) at 23 and 21.4%, respectively (Figure 2). Figure 2. Functional classification of MIP proteins. (A) Representation of MIP proteome based on the similarity of its proteins with COG database (27). (B) represents distribution in cell processing and signaling category (CPS), (C) denotes distribution of poorly characterized proteins in MIP while (D) and (E) stands for information storage and processing (ISP) and ‘metabolism’ related genes, respectively. It is evident that ∼42% of total MIP genes are involved in basic metabolic functions and ∼21% do not have any homology in COG database. Within ‘metabolism’ category, the genes involved in lipid transport and metabolism (I) are over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins are related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%). Comparative proteome analysis of MIP with other species reveals the role of genomic fluidity in habitat diversification in Mycobacterium COG-based comparative analysis of gene distribution across mycobacterial proteomes highlights the presence of distinct genome fluidity. ‘ISP’ and ‘Metabolism’ proteins vary considerably with the maximum flexibility being observed in replication, recombination and repair (L), lipid transport and metabolism (I) and secondary metabolites biosynthesis and transport (Q), respectively (Figure 3). The minimum variations are observed in ‘CPS’ with nearly all sub-categories exhibiting a consistent representation. In ‘ISP’, the distribution of genes across all mycobacterial proteomes is almost consistent for translation, ribosomal structure and biogenesis (J) and chromatin structure and RNA processing (B), while a clear genomic fluidity is exhibited by the genes belonging to replication, recombination and repair (L). This category is least represented in MIP (3%) and maximally in M. ulcerans (10%). Similarly, the genes belonging to category K (transcription) are least represented in CDC1551 (5%) and maximally in M. smegmatis (9.4%), which is consistent with its saprophytic habitat. Figure 3. Comparative analysis of distribution of different mycobacterial proteomes under various COG functional categories. Different mycobacterial proteomes were downloaded from NCBI and subjected to COG-based BLAST analysis. The contribution of each functional category was calculated to observe the pattern of relative gene distribution across different mycobacterial species and plotted on this graph. (A) Distribution across ‘Metabolism’ category and various sub categories, (B) cell processing and signaling (CPS) and (C) information storage and processing (ISP). ‘X’ and ‘Y’ axis represent mycobacterial species and the number of mycobacterial proteins (in percentage), respectively. Our comparative analysis clearly highlights the presence of distinct genome fluidity in mycobacterial species across different functional groupings of genes. This genomic fluidity within different functional groups of proteins may contribute to the habitat diversification observed in mycobacterial species. In ‘Metabolism’, while the genes related to nucleotide transport and co-enzyme transport show a consistent distribution, the genes belonging to secondary metabolite biosynthesis and transport (Q), amino acid transport (E) and lipid transport and metabolism (I) show major quantitative variations. ‘I’ has the maximum representation in MAC like MAH (∼11%), MAP (10%) followed by MIP (9.5%) while ‘E’ and ‘Q’ are best represented in M. smegmatis (9.5%) and MAC organisms (9–10%), respectively (Figure 3). In case of carbohydrate transport and metabolism (G), all mycobacterial species have almost an equal representation except M. smgematis, which harbors almost twice (7%) the percentage of genes dedicated for this function in other mycobacterial species. In most COG categories, M. leprae seems to have a distinctly biased distribution of proteins probably indicative of the extensive gene-loss that the organism has undergone during evolution (21). Of all the mycobacteria, MIP has the least representation in L (3%) and E (amino acid transport) (4.3%) categories of genes. Although the distribution of genes is a species-specific attribute, variations in gene distribution across different lineages could provide an idea about the role of genomic fluidity in shaping the behavior of mycobacteria as saprophytes or host-adapted pathogens. Hence, to get a comprehensive picture of habitat transformation, mycobacterial species were classified in two groups according to their known attributes: pathogenic (PGN) comprising of M. tuberculosis complex (including M. marinum and M. ulcerans) and M. avium complex and saprophytic or environmental (ENV) mycobacteria comprising of M. smegmatis, M. vanabaalenii, M. gilvum and others. MIP was placed in between saprophytic and pathogenic mycobacterial species because of its unique intermediate position and these two groups were investigated for effect of gene variations in different COG classes with MIP as a common background (4). A two-way ANOVA analysis was performed to ascertain the statistical significance of analysis. While the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes, i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function) [P < 0.001, 0.01, 0.001, 0.01, 0.001 and 0.001, respectively], ENV–PGN transitions involved extensive gene variations (Figure 4). In addition to the gene reduction observed in the earlier mentioned classes, reduction was also noticed in genes related to energy metabolism (C, G, I and Q [P < 0.05, 0.05, 0.001 and 0.05, respectively]) and a significant increase in L (replication, recombination and repair, P < 0.01) and N (cell motility and secretion, P < 0.01) related genes in ENV-PGN transition. Noticeably, the habitat change from MIP to PGN lineages was primarily due to the loss of genes involved in I (lipid transport and metabolism, P < 0.001) and Q (secondary metabolite biosynthesis and transport, P < 0.001) and gain of genes in L, E and S [P < 0.001, 0.001 and 0.05, respectively] (Figure 4). This observation augurs well for a reduced habitat diversity of pathogenic mycobacteria and indicated toward the role of genomic fluidity within selected gene functions towards habitat specification. An increase in the representation of ‘L’ with the advent of pathogenicity offers an interesting paradigm, which warrants further studies in the model organisms. Figure 4. Quantitative analysis of gene variations involved in habitat transformation in mycobacteria. This cartoon depicts variations across major functional gene groupings as mycobacterial species adapted to a pathogenic lifestyle from free-living environmental mycobacteria. Red lines denote loss of genes while the green ones denote the gene gain with a change of habitat. Although the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function), ENV–PGN transitions involved extensive gene variations and are consistent with the intermediate evolutionary position of MIP. Significantly, only two major gene categories reported gain of genes associated with the advent of pathogenicity: L (DNA replication, recombination and repair) and E (amino acid transport and metabolism). But transition from purely saprophytic lineage to pathogenic habitat is associated with genes categorized into ‘L’ only, which also contains transposon elements. Indeed, we observe that saprophytic mycobacterium like MIP has only 38 transposons-like elements compared with 302 found in similar-sized pathogenic mycobacterial species M. ulcerans. A two-way ANOVA analysis was used to ascertain statistical significance. Role of natural selection in speciation in Mycobacterium Measurement of the rate of non-synonymous (leading to change in amino acid) and synonymous (silent) nucleotide substitutions in protein-coding DNA sequences is the most referred criterion for detecting natural selection in molecular evolutionary analysis (46). Significantly higher non-synonymous nucleotide substitutions (Ka) over the synonymous (Ks) ones are interpreted as an evidence of positive natural selection. Hence, to understand the contribution of selection in speciation, we have used closely related and phylogenetically independent species of M. avium complex of which MIP is a predecessor (19). Orthologs were identified using Inparanoid tool (29) and dataset of ∼2600 gene pairs representing >80% of the orthologs shared among different species of MAC was obtained to perform comparative analysis of the rate of selection for human-adapted (MIP–MAH) and animal-adapted (MIP–MAP) niches from a saprophytic MIP. The evaluation of rate of selection (Ka/Ks) revealed strong purifying selection (∼0.06) acting on both human and animal adapted lineages. However, further resolution of analysis based on protein function revealed a significant difference in the rate of selection only for the genes involved in energy production and conversion (C) (P < 0.03, unpaired t test) (Figure 5). Also, very few genes, mostly distributed in metabolic pathways, were found to have undergone strong positive selection (Supplementary Table S2) suggesting their relevance in undergoing niche-specific adaptations in Mycobacterium. A very strong positive selection (>50 times of average rate) was observed in ComEC (MIP2580) (47), the competence protein required for exogenous DNA uptake during natural transformation, which can critically influence the ability to acquire foreign DNA in microbial species. Incidentally, we found this gene to be pseudogenized in MAP, which usually results from an excessive positive selection. In a recent study (48) based on SNP analysis, it was argued that recombination may influence the rate of selection in extremely closely related species of M. tuberculosis complex (average nucleotide identity >98% across different species). Even though MIP is likely to have minimal homologous recombination events because of sequence heterogeneity with MAP and MAH, the likelihood of recombination and lateral gene transfer influencing the rate of selection cannot be completely discounted. Figure 5. Role of natural selection in speciation in MAC. Analysis of average rate of natural selection (Ka/Ks) among MIP–MAP and MIP–MAH lineages revealed the presence of a similar purifying selection (46). This implies that both mycobacterial lineages have undergone an independent evolution into their respective host adapted forms from MIP. However, a significant skew in selection rate (Ka/Ks) is observed in genes categorized into energy production and conversion (C), and thus establish the role of metabolism-related genes in the evolution of host tropism. Also, a strong positive selection (>50 times of average selection) was observed on ComEC gene that encodes a competence protein required for DNA uptake and natural transformation (47). Such a strong selection on this gene indicates that ComEC has played an important role in modulating the efficiency of DNA uptake during mycobacterial evolution. In fact, in the case of MAP, this gene is found to be pseudogenized, which usually results from an excessive positive selection. Identification of laterally transferred genes reveals massive gene acquisitions and mosaic architecture of MIP genome Identification of laterally acquired genes is an important paradigm, which is cardinal to gain a deeper insight into microbial evolution (49). Hence, after analysing the role of natural selection in speciation, we were keen to analyse the contribution of lateral gene acquisitions in MIP. The precise and accurate prediction of lateral gene transfer (LGT) events in an organism is challenging. First, detection of LGT may be influenced not only by source, size and quantity of lateral transfer but also by the genetic features associated with the recipient or host genome (50). Besides, LGT takes place by a variety of means and different tools may be required for better detection of LGT based on specific mechanisms of gene transfer (51). It is also known that different surrogate methods detect lateral acquisitions of different antiquities (52). Hence, all LGT are not amenable to detection by a single parametric method and the application of a combination of different methods is recommended to improve sensitivity of detection in different possible situations (50). However, while the simple addition of predictions from individual methods may increase false-positive rates, the consideration of strictly overlapping predictions as the inclusion criteria for LGT predictions is counterproductive because of the limited overlap of genes observed between different approaches (52,53). Nonetheless, it has been argued that even if the errors inherent to these individual methods are added, the overall benefit is worthy (50). Hence, to predict laterally acquired genes in MIP, we used three different parametric methods based on anomalous GC content of each ORF, genomic signature analysis and Alien Hunter predictions to score for likely LGT candidates. The genes were scored as laterally acquired only if they were predicted more than once. This would not only provide sensitivity of detection but also reduce the number of false-positive predictions associated with individual methods. We further augmented our analysis by using information on phylogenetic approaches and phyletic distribution of MIP genes in other mycobacterial species as an additional stand alone criterion to score LGT genes (54). Analysis of atypical GC content of ORFs (34) identified ∼28.5% (1503/5270) of MIP genes as putative candidates for LGT. A similar analysis with M. tuberculosis (MTB), M. avium paratuberculosis (MAP) and M. avium subsp. hominissuis (MAH) could only identify 4.3, 6.5 and 11.3% genes, respectively (55). Genomic signature approach could identify ∼33% of MIP genes as candidate LGT’s as compared to MTB, MAP and MAH wherein this approach yielded only 6%, 6.3 and 9.3% genes, respectively. Alien Hunter (32) predicted 85 probable laterally acquired regions (AL) comprising of 1298 (24.63%) ORFs (Supplementary Table S3). The regions around ‘ori’ (15 kb on both sides, upstream as well as downstream) and one harboring ribosomal genes were excluded from the evaluation to remove any possible bias. By using similar criteria with Alien Hunter, however, we could identify putative laterally acquired genes in MTB (21.5%), MAP (15.35%) and MAH (24.2%). More than 42% of MIP genes predicted by Alien Hunter are also shared by genomic signature analysis. A similar analysis with MTB, MAP and MAH showed an overlap of 17% (148/865), 31% (215/678) and 33.7% (421/1247), respectively, between Alien Hunter–predicted genes and genomic signature–based predictions. After applying our ‘majority’ based inclusion criteria, ∼6.2% of the genes emerged as laterally acquired in MTB, while MAP and MAH have 8.3 and 10.2% genes as LGT, respectively. By using this approach, ∼34% of MIP genes emerged as laterally acquired, which is significantly higher than in other mycobacterial species analysed in this study. A comparative analysis of MIP ORFs based on their restricted distribution within the other mycobacterial genomes identified additional 939 ORFs as plausible lateral acquisitions. This included 362 ORFs harbored by 93 defined RRDs (regions of restricted distribution of genes) in MIP (Supplementary Table S4) and 261 ORFs present in alien regions. The incongruence observed in the phylogenetic analysis of some of these genes substantiated their laterally acquired nature. Overall, 50.5% (2664/5270) of MIP ORFs appear to be laterally acquired highlighting thereby the scale of evolutionary novelties undergone by this microbe (Figure 6). This study represents the first report of such massive gene acquisitions in mycobacteria and suggests mosaic architecture of MIP genome. Figure 6. Depiction of lateral gene acquisitions in MIP. Each column depicts one MIP gene and each row depicts one mycobacterial genome (total 18 genomes comprising of M. tuberculosis complex, M. avium complex and saprophytic mycobacterial species–see Materials and Methods and Supplementary Table S1 for species list); green and red denote presence and absence, respectively, of the MIP gene in other genomes. Pink denotes the regions predicted by Alien Hunter (32). Dark yellow represents RRDs, while black columns denote recently acquired genes identified by using atypical gene content and gene signatures (34, 35). Orange arrow denotes position of tRNA molecules, blue denotes genes with homologs in genomes other than mycobacteria, while brown denotes absence in the COG database. A very good overlap is observed between genes identified by using different methods. Most of the alien regions and RRD’s overlap with red, substantiating the effectiveness and accuracy of our approach. It is noteworthy that over 50% of MIP genome has emerged as laterally acquired, the highest reported so far for any of the mycobacterial species. The figure is scaled to approximation with each figure row denoting 1 Mb of genome and every tick mark denoting 100 kb along the lane. Analysis of laterally acquired genes by using COG functional classification revealed the maximum gain in lipid transport and metabolism category (I) followed by transcription (K)-related genes, which are usually under-represented among laterally acquired genes in prokaryotes (56). This was followed by the genes affiliated to secondary metabolites biosynthesis, transport and catabolism (Q) and energy production and conversion (C); these four categories together constitute ∼35% of the total lateral acquisitions. In addition, the LGT predictions based on atypical GC content of ORFs (34) and further validated by genomic signatures 1478 (28.05%) (Figure 7A) appear to retain their native genomic imprints, which are yet to be masked by natural selection. This points toward their relatively recent acquisition and hence, their likely source could be ascertained (33). Analysis based on genomic signatures revealed that majority of these recently acquired genes (∼85%) are most likely derived from actinobacterial species (Figure 7B) like Streptomyces (∼25%), Amycolatopsis (∼15%), Rhodococcus (7.5%) and Frankia (6.5%). These gene acquisitions might have been mediated by physical proximity and close interactions among different actinobacteria. Figure 7. Identification of recent lateral gene acquisitions in MIP and their analysis. (A) Individual gene signatures of recently acquired MIP genes along with the whole genome signature of MIP establish the alien nature of respective genes (33). These gene signatures are based on the frequency of distribution of tetranucleotide pattern across the whole genome and individual genes of MIP, which are color coded to generate a visual impression. (B) Distribution of recently acquired genes with respect to their most likely source of acquisition. Based on the genomic signatures, our analysis revealed that majority of these recently acquired genes (∼85%) are possibly derived from actinobacterial species. Mobile elements based gene acquisition in MIP are dominated by plasmid-mediated lateral gene transfers LGT events are usually mediated by mobile genetic elements like phages, transposons and plasmids. BLAST analysis of laterally acquired genes against ACLAME (36), a database of mobile elements comprising all known phage genomes, plasmids and transposons, indicated mobile elements as likely source to 27.4% of these putative laterally acquired ORFs (<e−20). Majority of these genes exhibit similarity with plasmids and extremely small fraction with phages (2%) and IS elements (∼1.2%). The relative paucity of phage and IS elements mediated gene acquisitions and abundance of plasmid-acquired genes in case of MIP is surprising. In comparison with other mycobacterial species of similar sized genomes such as M. ulcerans (chromosome size ∼5.6 Mb), which has 302 IS elements/transposons (23), MIP has merely 38 genes harboring sequences consistent with IS signatures. This is consistent with our earlier analysis based on genomic fluidity across different mycobacterial lineages where variation in the number of transposable elements, which are classified in category L, was found to be associated with habitat diversification. It is tempting to envisage that MIP may harbor specific genomic determinants that either provide immunity from phages and transposons, or else predispose MIP toward plasmid-based gene acquisitions. MIP has a relatively higher number of CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat) elements compared with other mycobacterial genomes (7 as compared with 1–2 in other mycobacterial genomes) (37). These CRISPR molecules not only provide immunity against invasion by phages and viruses (57) but also limit the mobility of IS elements in genome and help in their excision from genome (58). MIP also lacks RD1 region, the loss of which facilitates efficient conjugation with plasmids and other chromosomes to promote rapid acquisitions of genes (59). In addition, we found that MIP is particularly enriched in transporters of septal DNA translocator family (6 as against 1–2 present in other mycobacteria) (Table 3) that are known to bring out rapid acquisition of genes by mediating cell to cell DNA transfer during plasmid conjugation (60). The abundance of these genomic determinants coupled with the absence of RD1 locus may contribute to the propensity of MIP towards plasmid-mediated gene acquisitions. Table 3. Comparative transporter analysis of MIP with other mycobacterial species MSMEG, M. smegmatis; MAP, M. avium susbsp. paratuberculosis; MTB, M. tuberculosis. Effect of lateral gene acquisitions on different gene families in MIP Two distinct observations have emerged from this analysis: (i) a large number of lateral gene acquisitions in MIP have been mediated through mobile elements with only a small contribution through phages and (ii) gene distribution among laterally acquired regions in MIP follows a skewed pattern with respect to function as indicated by over-representation of genes belonging to certain categories. To know the influence of LGT events on distribution of genes across MIP gene families, we performed a comprehensive analysis that showed that CYP450 is the largest gene family in MIP with 66 members and a gene density of ∼12/Mb. This is remarkably high in comparison with other mycobacterial species such as M. tuberculosis (4.5/Mb), M. smegmatis (5.6/Mb) and M. marinum (7.1/Mb). The analysis of Cytochrome P450 database (http://drnelson.uthsc.edu/CytochromeP450.html (15 August 2012, date last accessed)) revealed that MIP harbors the highest number of genes from CYP450 family among prokaryotes sequenced so far and ∼ 46% (30/66) of these genes are laterally acquired. Approximately 27% of these genes are recent acquisitions, suggesting a recent expansion of CYP450 family. Approximately 11% of CYP450 genes were identified as unique by International CYP450 nomenclature commission and have been classified into three new families and two new sub-families of CYP450 (Table 4). The context-based analysis based on gene neighborhood suggested the role of these genes in the utilization of unusual carbon sources, a key to adaptability and survival of MIP at its most likely habitat at soil–water interface (19). Table 4. List of CYP450 ORFs unique to MIP aNomenclature as per International Committee of CYP450 nomenclature. ‘A1’ refers to the first member of a new CYP450 family, whereas ‘B1’ refers to the first member of a new CYP450 sub-family. MIP has 66 genes of PE–PPE family (16 PE and 50 PPE), which encompass complete repertoire of PPE genes present in various MAC species. These PPE genes appear to be selectively inherited by various species of MAC as evident from their distribution in MAP (36 genes), MAH (38 genes), M. avium subsp. avium (36 genes) and M. intracellulare (38 genes), respectively (61). This observation substantiates that evolutionarily MIP is a predecessor of MAC and endorses our earlier findings based on the rate of natural selection that speciation and habitat diversification has taken place independently from MIP. Comparative analysis of PE–PPE genes in MIP (with Nr database at NCBI) highlighted that five genes of PPE–SVP sub-family are unique to MIP. In addition, several of its PE genes are evolutionarily closer to those belonging to M. tuberculosis complex, which is in agreement with the shared evolutionary history of MIP and predecessors of M. tuberculosis complex (19). The presence of PPE gene family is unique to mycobacteria among prokaryotes; however, its origin remains unknown. A large number of these genes are laterally acquired in MIP prompting us to speculate that PE–PPE genes might have been introduced into mycobacteria through mobile elements. Majority of PE–PPE gene clusters in MIP harbor genes related to mobile function activity such as phages, tRNA or 13e12 repeats in their vicinity. Besides, several of these PE–PPE genes exhibited the presence of Ig-like motifs often present in the proteins of tailed double stranded DNA bacteriophage particles. However, the most clinching evidence about the origin of these PE–PPE genes in mycobacteria emerged from the presence of a PPE protein containing intact prophage of ∼40 kb that we observed in MAH genome during ACLAME analysis (36), a direct evidence for a phage-mediated acquisition of PE–PPE family members. Hemerythrins: a versatile gene family laterally acquired in MIP Most surprising finding of MIP gene analysis, however, is the unusual presence of ORFs belonging to hemerythrin proteins (Hr) family, the oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids. MIP has 10 ORFs having significant similarity to hemerythrin (Hr) genes (Figure 8) in comparison with one or two copies in most prokaryotes (62). The prevalence of ‘Hr’ proteins strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface (OAI) and in the anoxic regions of the marine habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62). Figure 8. Distribution of hemerythrins in MIP. Domain mapping and BLAST searches indicated the presence of 10 ORFs belonging to hemerythrin genes in MIP. Hemerythrin proteins (Hr) are oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids and they usually have only to 1 or 2 copies in most prokaryotes (62). The abundance of these genes strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface and in the anoxic regions of the habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62,63). The ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could also provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style has been shown to grow at 0% oxygen, to reach a plateau in 3 days and die thereafter (64). MIP2918 and MIP2747 did not harbor a definite ‘Hr’ domain and are identified by BLAST searches against NCBI ‘Nr’ database. Domain mapping and comparative genomics with available mycobacterial genomes indicated the presence of two Hr domains in several MIP ORFs such as MIP2750, MIP2759, MIP5034 and MIP6380, a trait usually restricted only to proteobacteria (62). We could also identify putative homologs of ‘hr’ genes in mycobacteria such as M. marinum and M. ulcerans (1 each), M. tuberculosis (3), M. gilvum and M. smegmatis (4 each), M. avium complex (5), other environmental mycobacteria like M. sps. KMS, M. sps. JLS and M. vanabaalenii (three each) and none in the case of M. leprae. Considering the variations in the number of ‘hr’ genes in various species of mycobacteria and their significant sequence heterogeneity, it appears that acquisition of hemerythrins could have been a selective and independent event facilitating mycobacterial evolution. Incidentally, 90% of these ORFs in MIP are laterally acquired with over one-half of them being recent acquisitions. It should be noted that the efficiency of hemerythrins as oxygen storage proteins is directly dependent on oxygen concentration in its surrounding environment (63). The selective enrichment of MIP with Hr proteins and the ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style, can manage to grow at 0% oxygen, reach a plateau in 3 days and die thereafter (64). Membrane transporters in MIP Transport systems play a critical role in the life-endowing processes such as metabolism, metal homeostasis and secondary metabolite production, affecting thereby the physiology and lifestyle of the organism. In MIP, a total of 222 genes were annotated as membrane transporters comprising ∼4.2% of the total gene content and a transporter density of 39.73/Mb (Table 3). This is an apparent reflection on the unique evolutionary position of MIP as its transporter density is significantly lower than that of saprophytic M. smegmatis (60.43/Mb) and higher than M. tuberculosis (33.64/Mb), MAP (35.2/Mb) and M. leprae (17.2/Mb). It is likely that M. smegmatis and MIP, being saprophytic in nature need extensive transport machinery to support their life style, whereas intracellular organisms owing to their relatively stable environment have a reduced transporter requirement. Comparative analysis revealed a selective abundance of transporters belonging to septal DNA Translocator family (S-DNA-T) with distinct homology with FtsK/SpoIIIE proteins, which could primarily be responsible for high propensity of MIP toward plasmid conjugation and gene acquisitions (60). Another unique seven gene cluster of CPA3 (Proton Antiporter-3) family present in RRD79 (Supplementary Table S4), the largest region with restricted distribution, is responsible for species defining ability of MIP to grow on 5% NaCl (4). The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila (Figure 9) that can grow in high salt concentration (65). Along with Mn2+ transporters, MIP also possesses an abundance of catalases (4) and superoxide dismutases (5), some of them being laterally acquired, which mitigate oxidative stress and reflect not only upon the primitive origin of MIP but also equip it for intracellular adaptations. MIP can withstand the carbon starvation as evident from our experiments on nutritional stress. After 5 days of growth in PBS without any media or nutritional supplement, MIP exhibited no significant reduction in log CFU, reflecting upon its potential to undergo longer period of starvation. Thus, MIP appears to have fine-tuned its specific transport abilities by lateral gene acquisitions to gain physiological attributes required for its unique habitat. Figure 9. Phylogenetic analysis of CPA3 family cluster. This cluster is unique to MIP among mycobacterial species and each ORF of this complex encodes different subunits of a unique Na+/H+ antiporter. All genes have been laterally acquired as a unit, hence, a representative single gene is used to perform phylogenetic analysis by using maximum likelihood method available in Phylogeny Fr. Server (43). The numbers along the branches denote bootstrap values. The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila that can grow under salt conditions (3% NaCl w/v) (65). Genome-enabled ecophysiological and metabolic attributes of MIP and influence of LGT events The presence of a very high number of alternate sigma factors (24 in comparison with 13 in M. tuberculosis and MAP) endows MIP with a complex transcriptional flexibility necessary for it to respond to its unique life style (66). The interface life style (soil/water) of MIP, as substantiated by its genetic features, prompted us to look for the bio-degradative capabilities of MIP. In addition to the abundance of CYP450 genes, MIP has also laterally acquired homologs of 3-octaprenyl-4-hydroxybenzoate carboxy-lyase that are involved in anaerobic metabolism of phenol during degradation of plant substrates (67). Besides, as enlisted in Supplementary Table S5, it also possesses complete cyanide and thiocyanate biodegradation machinery including the complete enzyme complex (MIP3820-22) of thiocyanate hydrolase (alpha, beta and gamma subunits). This complex degrades thiocyanate and produces CO2 and NH3 that can be used by MIP as a nitrogen source (68). Notably, thiocyanate gene cluster is absent from all the pathogenic mycobacteria analyzed in this study (including M. abscessus and the opportunists of MAC). Although the ability of MIP to degrade different compounds and utilize diverse sources of carbon is perspicuous, the presence of an intact hydrogenases enzyme complex (Table 5) provides evidence for its chemolithotrophic nature even though further research is required to establish the functionality of this complex. The loss of hydrogenases concurs well with the advent of pathogenicity in mycobacteria across different lineages, an observation corroborated by previous studies on mycobacterial hydrogenases (69). Loss of accessory protein coding genes which are required for the maturation and assembly of the hydrogenase complex as well as integration of different metal ions renders this complex non-functional in immediate descendents of MIP (i.e. MAC species). Table 5. List of MIP ORFs encoding hydorgenase gene cluster In addition to these unique metabolic characteristics of MIP, fundamental differences were observed in the organization of lipid metabolic machinery, which is cardinal to the physiology and behavior of mycobacterial species (70). Although, the genetic machinery required for synthesis and modification of mycolic acids is present in MIP (70), a major reshuffle is observed in methoxy mycolic acid synthase gene operon. Sequence analysis further suggested the absence of papA5, a gene encoding polyketide-associated protein (Pap) required for the synthesis of virulence associated phthiocerol dimycocerosate (PDIM) (71). These traits are in agreement with the observation that members of MAC do not synthesize PDIM’s. Further analysis demonstrated the presence of a glycophospholipid (GPL) biosynthesis locus, which is a hallmark of antigenic diversity in MAC and appears to be laterally acquired in MIP. This gene cluster in MIP harbors an ORF (MIP4595) sharing significant similarity with ‘gsc’ gene of MAP. This gene constitutes a pathogenic island in pathogenic mycobacteria including M. tuberculosis (72). However, a comparative analysis of this locus with other MAC sequevars revealed the interruption of this locus by a six-gene cluster exclusive to MIP with four of these six genes being transposable elements. Thus, this GPL locus acts as a hotspot for transposon integration and is likely to play an important role in MIP’s unique biological attributes by influencing GPL biosynthesis. Non-pathogenic attributes of MIP and immunome analysis MIP as discussed before is non-infectious in mouse, guinea pig and monkey models (2,6,17,73). However, investigation of MIP against PAIDB (74), the pathogenic islands database identified the presence of three regions in MIP with genomic attributes similar to PAGI islands of Pseudomonas aeruginosa. These included a gene cluster (MIP227–MIP247) similar to PAGI 1 pathogenic island of P. aeruginosa isolated from a patient with a urinary tract infection (39); a PAG3 like genomic island (MIP272–MIP283) and another region homologous to PAI (MIP333–MIP343), a region similar to tcd (toxin complex D) island of Photorhabdus luminescens (75). A comparative analysis of MIP proteins with virulence factors database (VFDB), a comprehensive compilation of all known virulence factors, also revealed the presence of most of the genes in MIP that are reportedly associated with virulence in other mycobacterial species (40). Pathogenesis is a multi-factorial phenomenon that requires pathogen to attach, infect, sustain, proliferate and eventually disseminate itself inside the host. Hence, the loss of a component responsible for any of these functions is likely to result in the attenuation of virulence or pathogenicity. Thus, despite having PE-PPE genes and mce1 operon, which enable mycobacteria to invade the host cell, MIP lacks both mce2 and mce3 operons, which are essential for causing macrophage infections by M. tuberculosis and M. avium (76–78). The mce3 as well as mce2 mutants of M. tuberculosis are attenuated in mice although the latter shows no growth defect in macrophages. The mce2 mutant of M. tuberculosis elicits an altered immune response and exhibits no lung pathology along with enhanced survival in mice (76–78). Likewise, MIP lacks phospholipase (plc) ABCD genes, which are responsible for acquiring host fatty acids for their use as a potential carbon source during persistent infections both in tuberculous and non-tuberculous mycobacterial infections (79). Another factor crucial for mycobacterial pathogenicity is associated with the presence of latency-related genes that confer on mycobacteria the ability to survive and grow in microaerophilic environment for prolonged period of time. The devS/devR two-component system, essential for maintenance of dormant state in low oxygen conditions, is conspicuous by its absence in MIP (79). In addition to RD1 locus and toxin–antitoxin system, in silico studies further identified MIP as a natural mutant of anthranilate phosphoribosyltransferase gene trpD, which is involved in tryptophan biosynthesis (81,82). The absence of these critical determinants may severely compromise MIP’s ability to survive inside the host as the infection with MIP has been found to be self-limiting and clears off within 6–7 weeks (17). The limited survival of MIP in low oxygen inside macrophages despite the absence of devS/devR two-component system can be attributed to the prevalence of ‘Hr’ proteins. In silico analysis revealed a much higher fraction of putative antigenic proteins in MIP in comparison with BCG (Figure 10), and a majority among them being contributed by lateral acquisitions emphasizing the importance of LGT events in augmenting its immune potential. Besides, the significant sequence heterogeneity observed between MIP and M. tuberculosis proteins (as mentioned earlier) would render MIP proteins acquiescent to generate novel T-cell epitopes resulting in an enhanced immune response. Our analysis revealed that of the 36 proteins shared by MIP and M. leprae, which were absent in M. bovis BCG, 29 were highly immunogenic in nature (Table 6). The most prominent putative antigenic proteins were MIP0340 and MIP5962, both belonging to Hsp20 family and share a close similarity with the 18 kDa small heat shock protein of M. leprae (83). This protein bears several T-cell epitopes and generates CD4+ T-cell mediated immune response, a hallmark of protection against tuberculosis. Similarly, MIP7697 is a homolog of M. leprae protein MLep2649 that encodes a protein with excellent T-cell stimulating properties, which responds to more than 60% of tuberculosis patients (84). The presence of such immunodominant and productive antigens in MIP may potentiate the expression of an antigenic profile better than BCG against M. tuberculosis infection. Figure 10. Comparative analysis of immunomes of MIP and BCG and contribution of LGT events. In silico immunome analysis of MIP and its comparison with BCG revealed the presence of a greater number of antigenic proteins in MIP (41). This may subscribe to the unique potential of MIP for immunomodulation against various types of infections. Noteworthily, a significant proportion of these immunogenic proteins appear to be laterally acquired in MIP. Table 6. List of MIP ORFs shared between MIP and M. leprae and absent from BCG aAs predicted by in silico analysis of MIP proteins by VAXIJEN software at default parameters (41). In summary, different analyses performed in this study establish that MIP represents an organism at a unique phylogenetic point as the immediate predecessor of opportunistic mycobacterial species of MAC. It is also evident that natural selection in MAC has acted in a preferential manner on specific categories of genes leading to reduced habitat diversity of pathogenic bacteria, and thus facilitating host tropism. The genome of MIP is ∼5.6 Mb in size and is shaped by a large number of lateral gene acquisitions thus revealing, for the first time, mosaic architecture of a mycobacterial genome. Thus, this study offers a paradigm shift in our understanding of evolutionary divergence, habitat diversification and advent of pathogenic attributes in mycobacteria. A scenario for mycobacterial evolution is envisaged wherein the earliest evolving soil derived mycobacterial species like MIP underwent massive gene acquisitions to attain a unique soil–water interface habitat before adapting to an aquatic and parasitic lifestyle. These lateral acquisition events were selective and possibly facilitated by the presence of specific genetic factors (i.e. ComEC) that induce competence to acquire large chunks of DNA to confer immediate survival advantage to the recipient organism. The genes, such as members of ‘Hr’ family, acquired to assist mycobacteria survive in fluctuating oxygen levels, would have been instrumental in the initial advent of pathogenicity in the aquatic opportunistic mycobacterial species. Subsequently, mycobacterial species tuned their genetic repertoires to respective host adapted forms with a high degree of genomic fluidity aided by selective lateral gene acquisitions and gene loss by deletion or pesudogenization (19). Importantly, a significant increase in transposon elements in the pathogenic mycobacteria as compared with MIP, for the first time, suggests their possible role toward mycobacterial virulence and would be interesting to explore. In addition, comparative genomic analysis revealed a higher antigenic potential of MIP subscribing to its unique ability for immunomodulation against various types of infections and presents a template to develop reverse genetics based approaches to design better strategies against mycobacterial infections. ACCESSION NUMBERS MIP genome has been submitted to the genome depository at NCBI (accession no. CP002275). SUPPLEMENTARY DATA Supplementary Data are available at NAR Online: Supplementary Tables 1–5 and Supplementary Figure 1. FUNDING MIP Genome sequencing program was funded by the Department of Biotechnology, Government of India. V.S. acknowledges the Council of Scientific and Industrial Research (CSIR), New Delhi, for the award of research fellowship. Akhilesh K. Tyagi, Anil Kumar Tyagi and S.E. Hasnain are thankful to Department of Science and Technology, Government of India for J.C. Bose National Fellowships. S.E.H. is a visiting professor, King Saud University, Riyadh, Kingdom of Saudi Arabia and J.P.K. is a Tata Innovations Fellow. Funding for open access charge: University of Delhi, India. Conflict of intertest statement. None declared. Supplementary Material Supplementary Data
Section	INTRODUCTION Mycobacterium indicus pranii (MIP) is a saprophytic mycobacterial species that is known for its immunomodulatory properties (1–11). In late 70s, this bacterium, initially coded as Mycobacterium ‘w’, was selected from a panel of atypical mycobacteria for its ability to evoke cell mediated immune responses against M. leprae in leprosy patients (2,9). MIP, which shares antigens with both M. leprae and M. tuberculosis, provides protection against M. tuberculosis infection in mice (3,10,12,13) and accelerates sputum conversion in both type I and type II category of tuberculosis (TB) patients when used as an adjunct to chemotherapy (14,15). In HIV/TB co-infections, a single dose of MIP converted tuberculin −ve patients into tuberculin +ve in >95% of the cases (16). This attribute is unique to MIP because similar application of other saprophytic mycobacteria such as M. vaccae does not provide commensurate protection (17). Based on its demonstrated immunomodulatory action in various human diseases, MIP is the focus of several clinical trials (Table 1) and successful completion of one such trial has led to its use as an immunotherapeutic vaccine ‘Immuvac’ against leprosy (18). However, very little information is available about MIP’s molecular, biochemical, genetic and phylogenomic features. Table 1. Ongoing clinical trials of MIP in a diverse set of diseases Recently, in a molecular phylogenetic study by using candidate marker genes and FAFLP (fluorescent-amplified fragment length polymorphism techniques) fingerprinting assay, we showed that MIP belongs to a group of opportunistic mycobacteria and is a predecessor of M. avium complex (MAC) (19). A comprehensive analysis of cellular and biochemical features of MIP along with chemotaxonomic markers such as FAME (fatty acid methyl ester) analysis and comparison with other mycobacterial species established that MIP is endowed with specific attributes (4). It has a growth rate (time of colony appearance ∼6–8 days) that is faster than the typical slow growers such as M. tuberculosis (∼3 weeks) and slower in comparison with typical fast growers, such as M. smegmatis (∼3 days), and thus placing MIP somewhere in-between the slow and fast grower mycobacterial species (4). In Mycobacterium, fast growers usually represent non-pathogenic organisms whereas slow growers are usually specialized pathogens. MIP does not cause any infection in mice, guinea pigs and monkeys, the animal models in which it has been tested (6). Biochemical analysis also showed that MIP shares several features that are exclusive to either slow growers or fast growers (4). Even the FAME profiling of MIP, a key test for appropriate taxonomic placement of microbes, and its comparison with the fatty acid complement from other mycobacterial species corroborated the placement of this saprophyte in between fast and slow growers (4). Thus, MIP represents an organism placed at an evolutionarily transitory position with respect to a fast grower and a slow grower or a saprophyte and a seasoned pathogen. It is known that mycobacterial species represent one of the most dramatic examples of host tropism and habitat diversification. Mycobacterium has more than 125 notified species including saprophytes such as M. smegmatis, immunomodulators such as M. habana, M. vaccae and MIP, opportunist M. avium and strict intracellular pathogens like M. tuberculosis and M. leprae. This unmatchable competence of mycobacterial organisms and their diverse physiological characteristics can be attributed to the genome dynamics including genome organization, gene content, coordinated gene expression and ability to interact with the host machinery. An important unanswered question in this context remains as to how the soil-living saprophytic mycobacterial species turned into one of the most notorious intracellular pathogens. Thus, understanding of the genomic basis of habitat diversification could be crucial in evolving effective control measures against mycobacterial infections. Unfortunately, despite the publication of several mycobacterial genomes (20–23), the understanding and details of advent of parasitism within mycobacterial lineages remain obscure (especially with in MAC) although the evolution of niche adapted parasitic forms by genomic downsizing is an accepted norm in M. tuberculosis complex (21,23). In fact, formal genetic studies on species differences and divergences in mycobacteria have been severely limited by the unavailability of a related organism that represents the border of optimization between saprophytic and pathogenic mycobacterial species. In prokaryotic evolution, a few species such as Shigella flexneri and Yersinia pestis have been identified, which represent an early stage of host restricted adaptation by means of genome shedding (24). MIP because of its unique phylogenetic placement and associated biochemical features seems to be the first case of a mycobacterium species caught in transition just before it resorted to the pathogenic adaptations. Thus, it provides a unique opportunity to understand evolutionary divergence and putative bottlenecks responsible for the advent of intracellular mode of survival and pathogenic attributes in mycobacteria. We have sequenced complete MIP genome to gain an insight into its unique life style and molecular basis of immunomodulation. In addition, we have employed comparative genomics to understand the habitat diversification and bases and means of functional genetic correlates responsible for evolution of pathogenicity in ancestral mycobacterial lineages.
Title	INTRODUCTION
Table caption	Table 1. Ongoing clinical trials of MIP in a diverse set of diseases
Section	MATERIALS AND METHODS Sequencing of MIP genome The genome sequence of MIP was determined by employing Sanger sequencing by using a hybrid strategy of sequencing shot gun libraries (2 and 5 kb) and partial sequencing of some clones of large insert sized (>125 kb) BAC (bacterial artificial chromosome) library. Briefly, genomic DNA was isolated and whole genome shotgun libraries with average insert size of 2–3 kb and 4–5 kb were prepared by hydroshearing. Fragments of required size were gel-eluted, blunt-ended and cloned in plasmid vector pUC19. Clones were randomly picked from libraries having more than 90% insert and sequenced by Sanger’s di-deoxy terminator chemistry on ABI 3700 machines. A high quality BAC library was also prepared (www.mwg-biotech.com (15 August 2012, date last accessed)) and end sequenced by employing Sanger’s method to create a physical map of MIP genome that assisted in gap filling and resolving the ambiguities in genome assembly. Gap closing and the re-sequencing of low-quality regions were performed by sequencing the PCR products and the appropriate plasmid clones. These data were assembled by using the PHRED-PHRAP-CONSED package of software on four processor SunFire V400 series of server. Identification of open reading frames (ORFs) was carried out with the help of GLIMMER gene prediction software (25). Protein localization analysis was carried out with the help of PSORTB (26). Comparative proteome analysis of MIP with other species Functional annotation was carried out on the basis of sequence alignment with the known mycobacterial proteins as well as the COG (clusters of orthologous groups of proteins) (27) database with the help of BLAST (28) package. Several perl scripts were developed in-house for data analysis. To understand the effect of gene variations on the habitat diversification in mycobacterial species with respect to MIP, we performed BLAST analysis of MIP proteome against the proteomes of 18 other mycobacterial species used in this study. They were assigned to specific lineages of pathogenic and saprophytic mycobacteria based on their characteristic features, habitat and available literature. The members of M. tuberculosis complex including M. marinum and M. ulcerans and those belonging to M. avium complex were categorized as pathogenic group whereas rest of them were grouped as environmental mycobacteria. The positive hits against MIP proteins were filtered out and remaining genes (unique with respect to species under investigation) were analysed for their function based on COG classification and were quantified. This dataset was obtained for each species of both groups and was viewed as variation in unique gene content in each function category with respect to MIP. Analysis of rate of natural selection (Ka/Ks analysis) To understand the role of selection on speciation in MAC, the orthologous group of genes between MIP and M. avium subsp. hominissuis (MAH- human strain) and MIP and M. avium paratuberculosis (MAP-animal strain), were identified by using InParanoid program (29). This method bypasses multiple alignments and phylogenetic tree-based conventional approaches to detect orthology and thus minimizes any bias arising due to alignment or phylogeny method in the identification of orthologs. First, all possible pair wise similarity scores that scored higher than a cutoff value (bit score ≥ 50, overlap ≥70%, e ≤10−10) were detected from all-against-all BLAST comparisons and then the reciprocal genome-specific best hits were marked as orthologs. The orthologs were subsequently classified based on functional categories as per the similarity searches against COG database. The orthologs were aligned by using ClustalW (30) and each alignment was manually inspected for its correctness. Pairwise estimates of the non-synonymous (Ka) and synonymous (Ks) substitution rates were obtained by KaKs_Calculator program by using a maximum likelihood method based on the HKY85 model (31). Analysis of lateral gene acquisitions in MIP A combination of parametric methods, comparative genomics and phylogenetic approaches was employed to predict laterally acquired genes in MIP. First of all, we employed the three most popular parametric approaches namely Alien Hunter (32), genomic signature analysis (33) and by analysing atypical GC content of each ORF. Alien Hunter implements an interpolated variable order motifs theory to predict compositionally deviating regions with the highest recall value. The genome sequence of MIP was scanned and the fine tuning of the co-ordinates of alien regions was carried out by using advance optimization algorithm available in Alien Hunter. Besides, each MIP ORF was analysed for its length and nucleotide composition with respect to total and positional G + C contents (G + C [T], G + C [1], G + C [2] and G + C [3]). The genes were considered as extraneous on the basis of G + C content, if their total G + C (T) content deviated by >1.5 ς from the mean value of their genome or if deviations of G + C [1] and G + C [3] were of the same sign and at least one of them was >1.5 ς (34). The genes shorter than 300 bp and the genes coding for ribosomal genes were excluded from this analysis to avoid any extraneous results. We further augmented our analysis of MIP genome by using genomic signature based method previously used for mycobacteria (33,35). The genes that were scored by more than one method in these analyses were considered as laterally acquired. The genes confirmed by both genomic signature and GC content based methods were referred as recently acquired and these signatures were used to ascertain their likely source of acquisition. Further, we used the power of comparative genomics by analysing MIP genes for their presence/absence across available mycobacterial species ≤e−10). MIP regions having a non-uniform gene distribution across various mycobacteria, which are not scored by Alien Hunter, were annotated as RRD (regions of restricted distribution of genes). RRD has been defined as the region in MIP genome, which harbors the genes that are absent in a minimum of 33% of species investigated in this study and is at least represented by three contiguous genes or a region of >3 kb. The genes, which are absent in more than 50% of the species investigated were then referred as laterally acquired in RRDs and elsewhere in MIP genome. All the genes identified as possible lateral acquisitions in MIP were probed against COG database to analyse the functional role of gene acquisitions. Besides, laterally acquired genes were analysed by BLASTP algorithm against ACLAME (36), a database dedicated for the classification of mobile genetic elements (MGEs). Other in silico analysis and stress experiments CRISPR analysis was performed by using CRISPRFinder (37). Annotation of transporter genes was carried out by TransAAP (38). Pathogenic islands were inferred from PAIDB (39), the pathogenic island database. MIP was analysed by using Virulence factor database (VFDB) to ascertain the status of genes associated with virulence (40). In silico prediction of antigenicity was carried out with VAXIJEN (41). PFAM (http://pfam.sanger.ac.uk (15 August 2012, date last accessed).) was used to analyse and draw protein domains in a scaled manner. Motif scan tool at MyHits web server (http://myhits.isb-sib.ch (15 August 2012, date last accessed)) was used for further analysis of proteins and motifs (42). Phylogenetic analysis was performed by using maximum likelihood method available in Phylogeny Fr. Server (43). Influence of nutritional stress on MIP was evaluated on the basis of viable cell count at different time points (44). Statistical analyses Variations in gene distribution across different lineages were analysed by two-way ANOVA followed by Bonferroni posttests. P < 0.05 was considered as statistically significant. For studying natural selection, Fisher’s exact test (built in KaKs_Calculator program) for the small sample was applied to justify the validity of Ka and Ks calculated in this study. Only the ortholog pairs with P < 0.05 were considered for further analysis to infer the rate of natural selection. Paired t test was performed to ascertain the significance in the rate of selection between different organisms (P < 0.05). Total number of mycobacterial species analysed in this study ( = 18) The genome sequence along with annotation for the following organisms were downloaded from NCBI genome databanks and used in this study: M. marinum, M. ulcerans, M. tuberculosis H37Rv, M. tuberculosis H37Ra, M. tuberculosis CDC1551, M. tuberculosis F11, M. bovis, M. leprae, M. bovis BCG, M. avium supsp. paratuberculosis, M. avium 104, M. smegmatis, M. gilvum, M. abscessus, M. vanbaalenii, M. sps. JLS, M. sps. KMS and M. sps. MCS.
Title	MATERIALS AND METHODS
Section	Sequencing of MIP genome The genome sequence of MIP was determined by employing Sanger sequencing by using a hybrid strategy of sequencing shot gun libraries (2 and 5 kb) and partial sequencing of some clones of large insert sized (>125 kb) BAC (bacterial artificial chromosome) library. Briefly, genomic DNA was isolated and whole genome shotgun libraries with average insert size of 2–3 kb and 4–5 kb were prepared by hydroshearing. Fragments of required size were gel-eluted, blunt-ended and cloned in plasmid vector pUC19. Clones were randomly picked from libraries having more than 90% insert and sequenced by Sanger’s di-deoxy terminator chemistry on ABI 3700 machines. A high quality BAC library was also prepared (www.mwg-biotech.com (15 August 2012, date last accessed)) and end sequenced by employing Sanger’s method to create a physical map of MIP genome that assisted in gap filling and resolving the ambiguities in genome assembly. Gap closing and the re-sequencing of low-quality regions were performed by sequencing the PCR products and the appropriate plasmid clones. These data were assembled by using the PHRED-PHRAP-CONSED package of software on four processor SunFire V400 series of server. Identification of open reading frames (ORFs) was carried out with the help of GLIMMER gene prediction software (25). Protein localization analysis was carried out with the help of PSORTB (26).
Title	Sequencing of MIP genome
Section	Comparative proteome analysis of MIP with other species Functional annotation was carried out on the basis of sequence alignment with the known mycobacterial proteins as well as the COG (clusters of orthologous groups of proteins) (27) database with the help of BLAST (28) package. Several perl scripts were developed in-house for data analysis. To understand the effect of gene variations on the habitat diversification in mycobacterial species with respect to MIP, we performed BLAST analysis of MIP proteome against the proteomes of 18 other mycobacterial species used in this study. They were assigned to specific lineages of pathogenic and saprophytic mycobacteria based on their characteristic features, habitat and available literature. The members of M. tuberculosis complex including M. marinum and M. ulcerans and those belonging to M. avium complex were categorized as pathogenic group whereas rest of them were grouped as environmental mycobacteria. The positive hits against MIP proteins were filtered out and remaining genes (unique with respect to species under investigation) were analysed for their function based on COG classification and were quantified. This dataset was obtained for each species of both groups and was viewed as variation in unique gene content in each function category with respect to MIP.
Title	Comparative proteome analysis of MIP with other species
Section	Analysis of rate of natural selection (Ka/Ks analysis) To understand the role of selection on speciation in MAC, the orthologous group of genes between MIP and M. avium subsp. hominissuis (MAH- human strain) and MIP and M. avium paratuberculosis (MAP-animal strain), were identified by using InParanoid program (29). This method bypasses multiple alignments and phylogenetic tree-based conventional approaches to detect orthology and thus minimizes any bias arising due to alignment or phylogeny method in the identification of orthologs. First, all possible pair wise similarity scores that scored higher than a cutoff value (bit score ≥ 50, overlap ≥70%, e ≤10−10) were detected from all-against-all BLAST comparisons and then the reciprocal genome-specific best hits were marked as orthologs. The orthologs were subsequently classified based on functional categories as per the similarity searches against COG database. The orthologs were aligned by using ClustalW (30) and each alignment was manually inspected for its correctness. Pairwise estimates of the non-synonymous (Ka) and synonymous (Ks) substitution rates were obtained by KaKs_Calculator program by using a maximum likelihood method based on the HKY85 model (31).
Title	Analysis of rate of natural selection (Ka/Ks analysis)
Section	Analysis of lateral gene acquisitions in MIP A combination of parametric methods, comparative genomics and phylogenetic approaches was employed to predict laterally acquired genes in MIP. First of all, we employed the three most popular parametric approaches namely Alien Hunter (32), genomic signature analysis (33) and by analysing atypical GC content of each ORF. Alien Hunter implements an interpolated variable order motifs theory to predict compositionally deviating regions with the highest recall value. The genome sequence of MIP was scanned and the fine tuning of the co-ordinates of alien regions was carried out by using advance optimization algorithm available in Alien Hunter. Besides, each MIP ORF was analysed for its length and nucleotide composition with respect to total and positional G + C contents (G + C [T], G + C [1], G + C [2] and G + C [3]). The genes were considered as extraneous on the basis of G + C content, if their total G + C (T) content deviated by >1.5 ς from the mean value of their genome or if deviations of G + C [1] and G + C [3] were of the same sign and at least one of them was >1.5 ς (34). The genes shorter than 300 bp and the genes coding for ribosomal genes were excluded from this analysis to avoid any extraneous results. We further augmented our analysis of MIP genome by using genomic signature based method previously used for mycobacteria (33,35). The genes that were scored by more than one method in these analyses were considered as laterally acquired. The genes confirmed by both genomic signature and GC content based methods were referred as recently acquired and these signatures were used to ascertain their likely source of acquisition. Further, we used the power of comparative genomics by analysing MIP genes for their presence/absence across available mycobacterial species ≤e−10). MIP regions having a non-uniform gene distribution across various mycobacteria, which are not scored by Alien Hunter, were annotated as RRD (regions of restricted distribution of genes). RRD has been defined as the region in MIP genome, which harbors the genes that are absent in a minimum of 33% of species investigated in this study and is at least represented by three contiguous genes or a region of >3 kb. The genes, which are absent in more than 50% of the species investigated were then referred as laterally acquired in RRDs and elsewhere in MIP genome. All the genes identified as possible lateral acquisitions in MIP were probed against COG database to analyse the functional role of gene acquisitions. Besides, laterally acquired genes were analysed by BLASTP algorithm against ACLAME (36), a database dedicated for the classification of mobile genetic elements (MGEs).
Title	Analysis of lateral gene acquisitions in MIP
Section	Other in silico analysis and stress experiments CRISPR analysis was performed by using CRISPRFinder (37). Annotation of transporter genes was carried out by TransAAP (38). Pathogenic islands were inferred from PAIDB (39), the pathogenic island database. MIP was analysed by using Virulence factor database (VFDB) to ascertain the status of genes associated with virulence (40). In silico prediction of antigenicity was carried out with VAXIJEN (41). PFAM (http://pfam.sanger.ac.uk (15 August 2012, date last accessed).) was used to analyse and draw protein domains in a scaled manner. Motif scan tool at MyHits web server (http://myhits.isb-sib.ch (15 August 2012, date last accessed)) was used for further analysis of proteins and motifs (42). Phylogenetic analysis was performed by using maximum likelihood method available in Phylogeny Fr. Server (43). Influence of nutritional stress on MIP was evaluated on the basis of viable cell count at different time points (44).
Title	Other in silico analysis and stress experiments
Section	Statistical analyses Variations in gene distribution across different lineages were analysed by two-way ANOVA followed by Bonferroni posttests. P < 0.05 was considered as statistically significant. For studying natural selection, Fisher’s exact test (built in KaKs_Calculator program) for the small sample was applied to justify the validity of Ka and Ks calculated in this study. Only the ortholog pairs with P < 0.05 were considered for further analysis to infer the rate of natural selection. Paired t test was performed to ascertain the significance in the rate of selection between different organisms (P < 0.05).
Title	Statistical analyses
Section	Total number of mycobacterial species analysed in this study ( = 18) The genome sequence along with annotation for the following organisms were downloaded from NCBI genome databanks and used in this study: M. marinum, M. ulcerans, M. tuberculosis H37Rv, M. tuberculosis H37Ra, M. tuberculosis CDC1551, M. tuberculosis F11, M. bovis, M. leprae, M. bovis BCG, M. avium supsp. paratuberculosis, M. avium 104, M. smegmatis, M. gilvum, M. abscessus, M. vanbaalenii, M. sps. JLS, M. sps. KMS and M. sps. MCS.
Title	Total number of mycobacterial species analysed in this study ( = 18)
Section	RESULTS AND DISCUSSION Genome sequencing and general features of MIP genome Sequencing of MIP (DSM 45 239T) genome was carried out by whole genome shotgun (WGS) approach. A total of 109 792 paired end reads, comprising of more than 10× coverage of MIP genome, were generated from randomly picked shotgun clones from both ∼2 and ∼5 kb shotgun libraries followed by gap filling and sequence improvement. Sequence assembly with PHRAP resulted in the assembly of 93 592 shotgun sequences leading to a single circular MIP chromosome of 5 589 007 bp (Figure 1). This was subsequently validated by a BAC end sequence based physical map of MIP genome. Mycobacterial genomes range from 3.5 to 7 Mb and MIP with a size of ∼5.6 Mb represents a moderate genome size, which is larger than all known organisms of MAC. The genome contains 5270 predicted ORFs (at a density of ∼1 gene/kb), a single rRNA operon and 45 tRNA genes; these ORFs account for ∼91% of the genome (Table 2). The mean G + C content of MIP genome is 68%. However, the cumulative nucleotide skew analysis revealed several regions with a G + C content clearly divergent from this mean value, which cover considerable area in MIP genome and constitute potential sites to investigate for laterally acquired genes (Figure 1). The putative ‘ori’ in MIP genome was identified by a relatively AT rich region with characteristic DnaA boxes and a typical gene order of ‘rpnP-dnaA-dnaN’. The ‘ATG’ was found to be the most frequent start codon (56.5%) followed by ‘GTG’ (37.5%) and ‘TTG’ (5.9%). Like M. tuberculosis, MIP has an even distribution of ORFs on both strands with respect to the direction of replication (2656 on the leading strand and 2614 ORFs on lagging strand) (4). PSORTB analysis indicated that 55.5% of MIP proteins are cytoplasmic in nature, 13.5% are localized in the cytoplasmic membrane and only 3.5% are extra-cellular in nature (26). However, the precise localization of 27.5% of the proteins could not be ascertained. Figure 1. Circular representation of MIP genome. Whole genome sequencing of MIP revealed that it harbors a single circular chromosome of 5 589 007 bp. The accuracy of genome data assembly is ensured by a BAC end sequence based physical map of MIP genome. The size of MIP genome is much larger than the genome of any member of M. avium complex and thus is in agreement with the progenitor status of MIP (19). The red and blue tracks represent ORFs predicted in the sense and anti-sense orientation in relation to the ori (origin of replication). The inner most track represents the GC skew wherein sharp peaks of violet and yellow represent regions of AT and GC richness, respectively, and constitute potential targets for lateral gene analysis . Table 2. General genomic features of MIP BLAST-based comparative analysis of MIP ORFs (at a cut off value of ≥70% amino acid identity) revealed their maximum similarity with MAC organisms, which are evolutionarily close to MIP (Supplementary Figure S1). This is followed by M. marinum with which MIP shares over 51% of its coding sequences (CDS) (Supplementary Table S1). This observation is consistent with the status of MIP as the progenitor of MAC and supports the idea of a shared aquatic past between saprophytic and pathogenic mycobacteria (19,45). With M. tuberculosis, MIP shares only ∼40% of its proteins. However, the number of MIP ORFs (∼68%) shared by closely related MAC species strikingly differs in comparison with other related mycobacteria, which usually share over 90% of coding sequences even at identity >95% (22). This divergence could be a critical component for the elicitation of a robust yet unique immune response upon vaccination with MIP. Functional classification of MIP proteins To facilitate functional studies, MIP proteins were subjected to BLAST analysis against the COG database, which serve as a platform for functional annotation of newly sequenced genomes and for studies on genome evolution (27). On the basis of similarity with COG proteins, it was possible to assign functions to ∼80% of MIP proteins but ∼20% of the proteins still remain un-annotated. More significantly, ∼7.5% of proteins are unique to MIP and show no significant homology with other proteins present in mycobacterial proteomes. Several of these candidate orthologs are present in gene clusters, which are absent from most of the other mycobacteria, and thus indicating the modular nature of gene acquisitions or deletions in mycobacteria. Our analysis shows that 41.5% of MIP proteins belong to ‘Metabolism’ category, 11.5% to ‘ISP’ (information storage and processing), and 9.5% to ‘CPS’ (cellular processes and signaling) whereas 16.7% are ‘poorly’ categorized proteins (Figure 2). Within ‘Metabolism’ category, the genes pertaining to lipid transport and metabolism (I) were over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins were related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%) and translational, ribosomal structure and biogenesis (J) (24.5%). In case of ‘CPS’, major representation comes from cell wall/membrane/envelope biogenesis (M) (27.6%) followed by posttranslational modifications (O) and signal transduction mechanisms (T) at 23 and 21.4%, respectively (Figure 2). Figure 2. Functional classification of MIP proteins. (A) Representation of MIP proteome based on the similarity of its proteins with COG database (27). (B) represents distribution in cell processing and signaling category (CPS), (C) denotes distribution of poorly characterized proteins in MIP while (D) and (E) stands for information storage and processing (ISP) and ‘metabolism’ related genes, respectively. It is evident that ∼42% of total MIP genes are involved in basic metabolic functions and ∼21% do not have any homology in COG database. Within ‘metabolism’ category, the genes involved in lipid transport and metabolism (I) are over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins are related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%). Comparative proteome analysis of MIP with other species reveals the role of genomic fluidity in habitat diversification in Mycobacterium COG-based comparative analysis of gene distribution across mycobacterial proteomes highlights the presence of distinct genome fluidity. ‘ISP’ and ‘Metabolism’ proteins vary considerably with the maximum flexibility being observed in replication, recombination and repair (L), lipid transport and metabolism (I) and secondary metabolites biosynthesis and transport (Q), respectively (Figure 3). The minimum variations are observed in ‘CPS’ with nearly all sub-categories exhibiting a consistent representation. In ‘ISP’, the distribution of genes across all mycobacterial proteomes is almost consistent for translation, ribosomal structure and biogenesis (J) and chromatin structure and RNA processing (B), while a clear genomic fluidity is exhibited by the genes belonging to replication, recombination and repair (L). This category is least represented in MIP (3%) and maximally in M. ulcerans (10%). Similarly, the genes belonging to category K (transcription) are least represented in CDC1551 (5%) and maximally in M. smegmatis (9.4%), which is consistent with its saprophytic habitat. Figure 3. Comparative analysis of distribution of different mycobacterial proteomes under various COG functional categories. Different mycobacterial proteomes were downloaded from NCBI and subjected to COG-based BLAST analysis. The contribution of each functional category was calculated to observe the pattern of relative gene distribution across different mycobacterial species and plotted on this graph. (A) Distribution across ‘Metabolism’ category and various sub categories, (B) cell processing and signaling (CPS) and (C) information storage and processing (ISP). ‘X’ and ‘Y’ axis represent mycobacterial species and the number of mycobacterial proteins (in percentage), respectively. Our comparative analysis clearly highlights the presence of distinct genome fluidity in mycobacterial species across different functional groupings of genes. This genomic fluidity within different functional groups of proteins may contribute to the habitat diversification observed in mycobacterial species. In ‘Metabolism’, while the genes related to nucleotide transport and co-enzyme transport show a consistent distribution, the genes belonging to secondary metabolite biosynthesis and transport (Q), amino acid transport (E) and lipid transport and metabolism (I) show major quantitative variations. ‘I’ has the maximum representation in MAC like MAH (∼11%), MAP (10%) followed by MIP (9.5%) while ‘E’ and ‘Q’ are best represented in M. smegmatis (9.5%) and MAC organisms (9–10%), respectively (Figure 3). In case of carbohydrate transport and metabolism (G), all mycobacterial species have almost an equal representation except M. smgematis, which harbors almost twice (7%) the percentage of genes dedicated for this function in other mycobacterial species. In most COG categories, M. leprae seems to have a distinctly biased distribution of proteins probably indicative of the extensive gene-loss that the organism has undergone during evolution (21). Of all the mycobacteria, MIP has the least representation in L (3%) and E (amino acid transport) (4.3%) categories of genes. Although the distribution of genes is a species-specific attribute, variations in gene distribution across different lineages could provide an idea about the role of genomic fluidity in shaping the behavior of mycobacteria as saprophytes or host-adapted pathogens. Hence, to get a comprehensive picture of habitat transformation, mycobacterial species were classified in two groups according to their known attributes: pathogenic (PGN) comprising of M. tuberculosis complex (including M. marinum and M. ulcerans) and M. avium complex and saprophytic or environmental (ENV) mycobacteria comprising of M. smegmatis, M. vanabaalenii, M. gilvum and others. MIP was placed in between saprophytic and pathogenic mycobacterial species because of its unique intermediate position and these two groups were investigated for effect of gene variations in different COG classes with MIP as a common background (4). A two-way ANOVA analysis was performed to ascertain the statistical significance of analysis. While the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes, i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function) [P < 0.001, 0.01, 0.001, 0.01, 0.001 and 0.001, respectively], ENV–PGN transitions involved extensive gene variations (Figure 4). In addition to the gene reduction observed in the earlier mentioned classes, reduction was also noticed in genes related to energy metabolism (C, G, I and Q [P < 0.05, 0.05, 0.001 and 0.05, respectively]) and a significant increase in L (replication, recombination and repair, P < 0.01) and N (cell motility and secretion, P < 0.01) related genes in ENV-PGN transition. Noticeably, the habitat change from MIP to PGN lineages was primarily due to the loss of genes involved in I (lipid transport and metabolism, P < 0.001) and Q (secondary metabolite biosynthesis and transport, P < 0.001) and gain of genes in L, E and S [P < 0.001, 0.001 and 0.05, respectively] (Figure 4). This observation augurs well for a reduced habitat diversity of pathogenic mycobacteria and indicated toward the role of genomic fluidity within selected gene functions towards habitat specification. An increase in the representation of ‘L’ with the advent of pathogenicity offers an interesting paradigm, which warrants further studies in the model organisms. Figure 4. Quantitative analysis of gene variations involved in habitat transformation in mycobacteria. This cartoon depicts variations across major functional gene groupings as mycobacterial species adapted to a pathogenic lifestyle from free-living environmental mycobacteria. Red lines denote loss of genes while the green ones denote the gene gain with a change of habitat. Although the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function), ENV–PGN transitions involved extensive gene variations and are consistent with the intermediate evolutionary position of MIP. Significantly, only two major gene categories reported gain of genes associated with the advent of pathogenicity: L (DNA replication, recombination and repair) and E (amino acid transport and metabolism). But transition from purely saprophytic lineage to pathogenic habitat is associated with genes categorized into ‘L’ only, which also contains transposon elements. Indeed, we observe that saprophytic mycobacterium like MIP has only 38 transposons-like elements compared with 302 found in similar-sized pathogenic mycobacterial species M. ulcerans. A two-way ANOVA analysis was used to ascertain statistical significance. Role of natural selection in speciation in Mycobacterium Measurement of the rate of non-synonymous (leading to change in amino acid) and synonymous (silent) nucleotide substitutions in protein-coding DNA sequences is the most referred criterion for detecting natural selection in molecular evolutionary analysis (46). Significantly higher non-synonymous nucleotide substitutions (Ka) over the synonymous (Ks) ones are interpreted as an evidence of positive natural selection. Hence, to understand the contribution of selection in speciation, we have used closely related and phylogenetically independent species of M. avium complex of which MIP is a predecessor (19). Orthologs were identified using Inparanoid tool (29) and dataset of ∼2600 gene pairs representing >80% of the orthologs shared among different species of MAC was obtained to perform comparative analysis of the rate of selection for human-adapted (MIP–MAH) and animal-adapted (MIP–MAP) niches from a saprophytic MIP. The evaluation of rate of selection (Ka/Ks) revealed strong purifying selection (∼0.06) acting on both human and animal adapted lineages. However, further resolution of analysis based on protein function revealed a significant difference in the rate of selection only for the genes involved in energy production and conversion (C) (P < 0.03, unpaired t test) (Figure 5). Also, very few genes, mostly distributed in metabolic pathways, were found to have undergone strong positive selection (Supplementary Table S2) suggesting their relevance in undergoing niche-specific adaptations in Mycobacterium. A very strong positive selection (>50 times of average rate) was observed in ComEC (MIP2580) (47), the competence protein required for exogenous DNA uptake during natural transformation, which can critically influence the ability to acquire foreign DNA in microbial species. Incidentally, we found this gene to be pseudogenized in MAP, which usually results from an excessive positive selection. In a recent study (48) based on SNP analysis, it was argued that recombination may influence the rate of selection in extremely closely related species of M. tuberculosis complex (average nucleotide identity >98% across different species). Even though MIP is likely to have minimal homologous recombination events because of sequence heterogeneity with MAP and MAH, the likelihood of recombination and lateral gene transfer influencing the rate of selection cannot be completely discounted. Figure 5. Role of natural selection in speciation in MAC. Analysis of average rate of natural selection (Ka/Ks) among MIP–MAP and MIP–MAH lineages revealed the presence of a similar purifying selection (46). This implies that both mycobacterial lineages have undergone an independent evolution into their respective host adapted forms from MIP. However, a significant skew in selection rate (Ka/Ks) is observed in genes categorized into energy production and conversion (C), and thus establish the role of metabolism-related genes in the evolution of host tropism. Also, a strong positive selection (>50 times of average selection) was observed on ComEC gene that encodes a competence protein required for DNA uptake and natural transformation (47). Such a strong selection on this gene indicates that ComEC has played an important role in modulating the efficiency of DNA uptake during mycobacterial evolution. In fact, in the case of MAP, this gene is found to be pseudogenized, which usually results from an excessive positive selection. Identification of laterally transferred genes reveals massive gene acquisitions and mosaic architecture of MIP genome Identification of laterally acquired genes is an important paradigm, which is cardinal to gain a deeper insight into microbial evolution (49). Hence, after analysing the role of natural selection in speciation, we were keen to analyse the contribution of lateral gene acquisitions in MIP. The precise and accurate prediction of lateral gene transfer (LGT) events in an organism is challenging. First, detection of LGT may be influenced not only by source, size and quantity of lateral transfer but also by the genetic features associated with the recipient or host genome (50). Besides, LGT takes place by a variety of means and different tools may be required for better detection of LGT based on specific mechanisms of gene transfer (51). It is also known that different surrogate methods detect lateral acquisitions of different antiquities (52). Hence, all LGT are not amenable to detection by a single parametric method and the application of a combination of different methods is recommended to improve sensitivity of detection in different possible situations (50). However, while the simple addition of predictions from individual methods may increase false-positive rates, the consideration of strictly overlapping predictions as the inclusion criteria for LGT predictions is counterproductive because of the limited overlap of genes observed between different approaches (52,53). Nonetheless, it has been argued that even if the errors inherent to these individual methods are added, the overall benefit is worthy (50). Hence, to predict laterally acquired genes in MIP, we used three different parametric methods based on anomalous GC content of each ORF, genomic signature analysis and Alien Hunter predictions to score for likely LGT candidates. The genes were scored as laterally acquired only if they were predicted more than once. This would not only provide sensitivity of detection but also reduce the number of false-positive predictions associated with individual methods. We further augmented our analysis by using information on phylogenetic approaches and phyletic distribution of MIP genes in other mycobacterial species as an additional stand alone criterion to score LGT genes (54). Analysis of atypical GC content of ORFs (34) identified ∼28.5% (1503/5270) of MIP genes as putative candidates for LGT. A similar analysis with M. tuberculosis (MTB), M. avium paratuberculosis (MAP) and M. avium subsp. hominissuis (MAH) could only identify 4.3, 6.5 and 11.3% genes, respectively (55). Genomic signature approach could identify ∼33% of MIP genes as candidate LGT’s as compared to MTB, MAP and MAH wherein this approach yielded only 6%, 6.3 and 9.3% genes, respectively. Alien Hunter (32) predicted 85 probable laterally acquired regions (AL) comprising of 1298 (24.63%) ORFs (Supplementary Table S3). The regions around ‘ori’ (15 kb on both sides, upstream as well as downstream) and one harboring ribosomal genes were excluded from the evaluation to remove any possible bias. By using similar criteria with Alien Hunter, however, we could identify putative laterally acquired genes in MTB (21.5%), MAP (15.35%) and MAH (24.2%). More than 42% of MIP genes predicted by Alien Hunter are also shared by genomic signature analysis. A similar analysis with MTB, MAP and MAH showed an overlap of 17% (148/865), 31% (215/678) and 33.7% (421/1247), respectively, between Alien Hunter–predicted genes and genomic signature–based predictions. After applying our ‘majority’ based inclusion criteria, ∼6.2% of the genes emerged as laterally acquired in MTB, while MAP and MAH have 8.3 and 10.2% genes as LGT, respectively. By using this approach, ∼34% of MIP genes emerged as laterally acquired, which is significantly higher than in other mycobacterial species analysed in this study. A comparative analysis of MIP ORFs based on their restricted distribution within the other mycobacterial genomes identified additional 939 ORFs as plausible lateral acquisitions. This included 362 ORFs harbored by 93 defined RRDs (regions of restricted distribution of genes) in MIP (Supplementary Table S4) and 261 ORFs present in alien regions. The incongruence observed in the phylogenetic analysis of some of these genes substantiated their laterally acquired nature. Overall, 50.5% (2664/5270) of MIP ORFs appear to be laterally acquired highlighting thereby the scale of evolutionary novelties undergone by this microbe (Figure 6). This study represents the first report of such massive gene acquisitions in mycobacteria and suggests mosaic architecture of MIP genome. Figure 6. Depiction of lateral gene acquisitions in MIP. Each column depicts one MIP gene and each row depicts one mycobacterial genome (total 18 genomes comprising of M. tuberculosis complex, M. avium complex and saprophytic mycobacterial species–see Materials and Methods and Supplementary Table S1 for species list); green and red denote presence and absence, respectively, of the MIP gene in other genomes. Pink denotes the regions predicted by Alien Hunter (32). Dark yellow represents RRDs, while black columns denote recently acquired genes identified by using atypical gene content and gene signatures (34, 35). Orange arrow denotes position of tRNA molecules, blue denotes genes with homologs in genomes other than mycobacteria, while brown denotes absence in the COG database. A very good overlap is observed between genes identified by using different methods. Most of the alien regions and RRD’s overlap with red, substantiating the effectiveness and accuracy of our approach. It is noteworthy that over 50% of MIP genome has emerged as laterally acquired, the highest reported so far for any of the mycobacterial species. The figure is scaled to approximation with each figure row denoting 1 Mb of genome and every tick mark denoting 100 kb along the lane. Analysis of laterally acquired genes by using COG functional classification revealed the maximum gain in lipid transport and metabolism category (I) followed by transcription (K)-related genes, which are usually under-represented among laterally acquired genes in prokaryotes (56). This was followed by the genes affiliated to secondary metabolites biosynthesis, transport and catabolism (Q) and energy production and conversion (C); these four categories together constitute ∼35% of the total lateral acquisitions. In addition, the LGT predictions based on atypical GC content of ORFs (34) and further validated by genomic signatures 1478 (28.05%) (Figure 7A) appear to retain their native genomic imprints, which are yet to be masked by natural selection. This points toward their relatively recent acquisition and hence, their likely source could be ascertained (33). Analysis based on genomic signatures revealed that majority of these recently acquired genes (∼85%) are most likely derived from actinobacterial species (Figure 7B) like Streptomyces (∼25%), Amycolatopsis (∼15%), Rhodococcus (7.5%) and Frankia (6.5%). These gene acquisitions might have been mediated by physical proximity and close interactions among different actinobacteria. Figure 7. Identification of recent lateral gene acquisitions in MIP and their analysis. (A) Individual gene signatures of recently acquired MIP genes along with the whole genome signature of MIP establish the alien nature of respective genes (33). These gene signatures are based on the frequency of distribution of tetranucleotide pattern across the whole genome and individual genes of MIP, which are color coded to generate a visual impression. (B) Distribution of recently acquired genes with respect to their most likely source of acquisition. Based on the genomic signatures, our analysis revealed that majority of these recently acquired genes (∼85%) are possibly derived from actinobacterial species. Mobile elements based gene acquisition in MIP are dominated by plasmid-mediated lateral gene transfers LGT events are usually mediated by mobile genetic elements like phages, transposons and plasmids. BLAST analysis of laterally acquired genes against ACLAME (36), a database of mobile elements comprising all known phage genomes, plasmids and transposons, indicated mobile elements as likely source to 27.4% of these putative laterally acquired ORFs (<e−20). Majority of these genes exhibit similarity with plasmids and extremely small fraction with phages (2%) and IS elements (∼1.2%). The relative paucity of phage and IS elements mediated gene acquisitions and abundance of plasmid-acquired genes in case of MIP is surprising. In comparison with other mycobacterial species of similar sized genomes such as M. ulcerans (chromosome size ∼5.6 Mb), which has 302 IS elements/transposons (23), MIP has merely 38 genes harboring sequences consistent with IS signatures. This is consistent with our earlier analysis based on genomic fluidity across different mycobacterial lineages where variation in the number of transposable elements, which are classified in category L, was found to be associated with habitat diversification. It is tempting to envisage that MIP may harbor specific genomic determinants that either provide immunity from phages and transposons, or else predispose MIP toward plasmid-based gene acquisitions. MIP has a relatively higher number of CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat) elements compared with other mycobacterial genomes (7 as compared with 1–2 in other mycobacterial genomes) (37). These CRISPR molecules not only provide immunity against invasion by phages and viruses (57) but also limit the mobility of IS elements in genome and help in their excision from genome (58). MIP also lacks RD1 region, the loss of which facilitates efficient conjugation with plasmids and other chromosomes to promote rapid acquisitions of genes (59). In addition, we found that MIP is particularly enriched in transporters of septal DNA translocator family (6 as against 1–2 present in other mycobacteria) (Table 3) that are known to bring out rapid acquisition of genes by mediating cell to cell DNA transfer during plasmid conjugation (60). The abundance of these genomic determinants coupled with the absence of RD1 locus may contribute to the propensity of MIP towards plasmid-mediated gene acquisitions. Table 3. Comparative transporter analysis of MIP with other mycobacterial species MSMEG, M. smegmatis; MAP, M. avium susbsp. paratuberculosis; MTB, M. tuberculosis. Effect of lateral gene acquisitions on different gene families in MIP Two distinct observations have emerged from this analysis: (i) a large number of lateral gene acquisitions in MIP have been mediated through mobile elements with only a small contribution through phages and (ii) gene distribution among laterally acquired regions in MIP follows a skewed pattern with respect to function as indicated by over-representation of genes belonging to certain categories. To know the influence of LGT events on distribution of genes across MIP gene families, we performed a comprehensive analysis that showed that CYP450 is the largest gene family in MIP with 66 members and a gene density of ∼12/Mb. This is remarkably high in comparison with other mycobacterial species such as M. tuberculosis (4.5/Mb), M. smegmatis (5.6/Mb) and M. marinum (7.1/Mb). The analysis of Cytochrome P450 database (http://drnelson.uthsc.edu/CytochromeP450.html (15 August 2012, date last accessed)) revealed that MIP harbors the highest number of genes from CYP450 family among prokaryotes sequenced so far and ∼ 46% (30/66) of these genes are laterally acquired. Approximately 27% of these genes are recent acquisitions, suggesting a recent expansion of CYP450 family. Approximately 11% of CYP450 genes were identified as unique by International CYP450 nomenclature commission and have been classified into three new families and two new sub-families of CYP450 (Table 4). The context-based analysis based on gene neighborhood suggested the role of these genes in the utilization of unusual carbon sources, a key to adaptability and survival of MIP at its most likely habitat at soil–water interface (19). Table 4. List of CYP450 ORFs unique to MIP aNomenclature as per International Committee of CYP450 nomenclature. ‘A1’ refers to the first member of a new CYP450 family, whereas ‘B1’ refers to the first member of a new CYP450 sub-family. MIP has 66 genes of PE–PPE family (16 PE and 50 PPE), which encompass complete repertoire of PPE genes present in various MAC species. These PPE genes appear to be selectively inherited by various species of MAC as evident from their distribution in MAP (36 genes), MAH (38 genes), M. avium subsp. avium (36 genes) and M. intracellulare (38 genes), respectively (61). This observation substantiates that evolutionarily MIP is a predecessor of MAC and endorses our earlier findings based on the rate of natural selection that speciation and habitat diversification has taken place independently from MIP. Comparative analysis of PE–PPE genes in MIP (with Nr database at NCBI) highlighted that five genes of PPE–SVP sub-family are unique to MIP. In addition, several of its PE genes are evolutionarily closer to those belonging to M. tuberculosis complex, which is in agreement with the shared evolutionary history of MIP and predecessors of M. tuberculosis complex (19). The presence of PPE gene family is unique to mycobacteria among prokaryotes; however, its origin remains unknown. A large number of these genes are laterally acquired in MIP prompting us to speculate that PE–PPE genes might have been introduced into mycobacteria through mobile elements. Majority of PE–PPE gene clusters in MIP harbor genes related to mobile function activity such as phages, tRNA or 13e12 repeats in their vicinity. Besides, several of these PE–PPE genes exhibited the presence of Ig-like motifs often present in the proteins of tailed double stranded DNA bacteriophage particles. However, the most clinching evidence about the origin of these PE–PPE genes in mycobacteria emerged from the presence of a PPE protein containing intact prophage of ∼40 kb that we observed in MAH genome during ACLAME analysis (36), a direct evidence for a phage-mediated acquisition of PE–PPE family members. Hemerythrins: a versatile gene family laterally acquired in MIP Most surprising finding of MIP gene analysis, however, is the unusual presence of ORFs belonging to hemerythrin proteins (Hr) family, the oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids. MIP has 10 ORFs having significant similarity to hemerythrin (Hr) genes (Figure 8) in comparison with one or two copies in most prokaryotes (62). The prevalence of ‘Hr’ proteins strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface (OAI) and in the anoxic regions of the marine habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62). Figure 8. Distribution of hemerythrins in MIP. Domain mapping and BLAST searches indicated the presence of 10 ORFs belonging to hemerythrin genes in MIP. Hemerythrin proteins (Hr) are oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids and they usually have only to 1 or 2 copies in most prokaryotes (62). The abundance of these genes strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface and in the anoxic regions of the habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62,63). The ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could also provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style has been shown to grow at 0% oxygen, to reach a plateau in 3 days and die thereafter (64). MIP2918 and MIP2747 did not harbor a definite ‘Hr’ domain and are identified by BLAST searches against NCBI ‘Nr’ database. Domain mapping and comparative genomics with available mycobacterial genomes indicated the presence of two Hr domains in several MIP ORFs such as MIP2750, MIP2759, MIP5034 and MIP6380, a trait usually restricted only to proteobacteria (62). We could also identify putative homologs of ‘hr’ genes in mycobacteria such as M. marinum and M. ulcerans (1 each), M. tuberculosis (3), M. gilvum and M. smegmatis (4 each), M. avium complex (5), other environmental mycobacteria like M. sps. KMS, M. sps. JLS and M. vanabaalenii (three each) and none in the case of M. leprae. Considering the variations in the number of ‘hr’ genes in various species of mycobacteria and their significant sequence heterogeneity, it appears that acquisition of hemerythrins could have been a selective and independent event facilitating mycobacterial evolution. Incidentally, 90% of these ORFs in MIP are laterally acquired with over one-half of them being recent acquisitions. It should be noted that the efficiency of hemerythrins as oxygen storage proteins is directly dependent on oxygen concentration in its surrounding environment (63). The selective enrichment of MIP with Hr proteins and the ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style, can manage to grow at 0% oxygen, reach a plateau in 3 days and die thereafter (64). Membrane transporters in MIP Transport systems play a critical role in the life-endowing processes such as metabolism, metal homeostasis and secondary metabolite production, affecting thereby the physiology and lifestyle of the organism. In MIP, a total of 222 genes were annotated as membrane transporters comprising ∼4.2% of the total gene content and a transporter density of 39.73/Mb (Table 3). This is an apparent reflection on the unique evolutionary position of MIP as its transporter density is significantly lower than that of saprophytic M. smegmatis (60.43/Mb) and higher than M. tuberculosis (33.64/Mb), MAP (35.2/Mb) and M. leprae (17.2/Mb). It is likely that M. smegmatis and MIP, being saprophytic in nature need extensive transport machinery to support their life style, whereas intracellular organisms owing to their relatively stable environment have a reduced transporter requirement. Comparative analysis revealed a selective abundance of transporters belonging to septal DNA Translocator family (S-DNA-T) with distinct homology with FtsK/SpoIIIE proteins, which could primarily be responsible for high propensity of MIP toward plasmid conjugation and gene acquisitions (60). Another unique seven gene cluster of CPA3 (Proton Antiporter-3) family present in RRD79 (Supplementary Table S4), the largest region with restricted distribution, is responsible for species defining ability of MIP to grow on 5% NaCl (4). The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila (Figure 9) that can grow in high salt concentration (65). Along with Mn2+ transporters, MIP also possesses an abundance of catalases (4) and superoxide dismutases (5), some of them being laterally acquired, which mitigate oxidative stress and reflect not only upon the primitive origin of MIP but also equip it for intracellular adaptations. MIP can withstand the carbon starvation as evident from our experiments on nutritional stress. After 5 days of growth in PBS without any media or nutritional supplement, MIP exhibited no significant reduction in log CFU, reflecting upon its potential to undergo longer period of starvation. Thus, MIP appears to have fine-tuned its specific transport abilities by lateral gene acquisitions to gain physiological attributes required for its unique habitat. Figure 9. Phylogenetic analysis of CPA3 family cluster. This cluster is unique to MIP among mycobacterial species and each ORF of this complex encodes different subunits of a unique Na+/H+ antiporter. All genes have been laterally acquired as a unit, hence, a representative single gene is used to perform phylogenetic analysis by using maximum likelihood method available in Phylogeny Fr. Server (43). The numbers along the branches denote bootstrap values. The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila that can grow under salt conditions (3% NaCl w/v) (65). Genome-enabled ecophysiological and metabolic attributes of MIP and influence of LGT events The presence of a very high number of alternate sigma factors (24 in comparison with 13 in M. tuberculosis and MAP) endows MIP with a complex transcriptional flexibility necessary for it to respond to its unique life style (66). The interface life style (soil/water) of MIP, as substantiated by its genetic features, prompted us to look for the bio-degradative capabilities of MIP. In addition to the abundance of CYP450 genes, MIP has also laterally acquired homologs of 3-octaprenyl-4-hydroxybenzoate carboxy-lyase that are involved in anaerobic metabolism of phenol during degradation of plant substrates (67). Besides, as enlisted in Supplementary Table S5, it also possesses complete cyanide and thiocyanate biodegradation machinery including the complete enzyme complex (MIP3820-22) of thiocyanate hydrolase (alpha, beta and gamma subunits). This complex degrades thiocyanate and produces CO2 and NH3 that can be used by MIP as a nitrogen source (68). Notably, thiocyanate gene cluster is absent from all the pathogenic mycobacteria analyzed in this study (including M. abscessus and the opportunists of MAC). Although the ability of MIP to degrade different compounds and utilize diverse sources of carbon is perspicuous, the presence of an intact hydrogenases enzyme complex (Table 5) provides evidence for its chemolithotrophic nature even though further research is required to establish the functionality of this complex. The loss of hydrogenases concurs well with the advent of pathogenicity in mycobacteria across different lineages, an observation corroborated by previous studies on mycobacterial hydrogenases (69). Loss of accessory protein coding genes which are required for the maturation and assembly of the hydrogenase complex as well as integration of different metal ions renders this complex non-functional in immediate descendents of MIP (i.e. MAC species). Table 5. List of MIP ORFs encoding hydorgenase gene cluster In addition to these unique metabolic characteristics of MIP, fundamental differences were observed in the organization of lipid metabolic machinery, which is cardinal to the physiology and behavior of mycobacterial species (70). Although, the genetic machinery required for synthesis and modification of mycolic acids is present in MIP (70), a major reshuffle is observed in methoxy mycolic acid synthase gene operon. Sequence analysis further suggested the absence of papA5, a gene encoding polyketide-associated protein (Pap) required for the synthesis of virulence associated phthiocerol dimycocerosate (PDIM) (71). These traits are in agreement with the observation that members of MAC do not synthesize PDIM’s. Further analysis demonstrated the presence of a glycophospholipid (GPL) biosynthesis locus, which is a hallmark of antigenic diversity in MAC and appears to be laterally acquired in MIP. This gene cluster in MIP harbors an ORF (MIP4595) sharing significant similarity with ‘gsc’ gene of MAP. This gene constitutes a pathogenic island in pathogenic mycobacteria including M. tuberculosis (72). However, a comparative analysis of this locus with other MAC sequevars revealed the interruption of this locus by a six-gene cluster exclusive to MIP with four of these six genes being transposable elements. Thus, this GPL locus acts as a hotspot for transposon integration and is likely to play an important role in MIP’s unique biological attributes by influencing GPL biosynthesis. Non-pathogenic attributes of MIP and immunome analysis MIP as discussed before is non-infectious in mouse, guinea pig and monkey models (2,6,17,73). However, investigation of MIP against PAIDB (74), the pathogenic islands database identified the presence of three regions in MIP with genomic attributes similar to PAGI islands of Pseudomonas aeruginosa. These included a gene cluster (MIP227–MIP247) similar to PAGI 1 pathogenic island of P. aeruginosa isolated from a patient with a urinary tract infection (39); a PAG3 like genomic island (MIP272–MIP283) and another region homologous to PAI (MIP333–MIP343), a region similar to tcd (toxin complex D) island of Photorhabdus luminescens (75). A comparative analysis of MIP proteins with virulence factors database (VFDB), a comprehensive compilation of all known virulence factors, also revealed the presence of most of the genes in MIP that are reportedly associated with virulence in other mycobacterial species (40). Pathogenesis is a multi-factorial phenomenon that requires pathogen to attach, infect, sustain, proliferate and eventually disseminate itself inside the host. Hence, the loss of a component responsible for any of these functions is likely to result in the attenuation of virulence or pathogenicity. Thus, despite having PE-PPE genes and mce1 operon, which enable mycobacteria to invade the host cell, MIP lacks both mce2 and mce3 operons, which are essential for causing macrophage infections by M. tuberculosis and M. avium (76–78). The mce3 as well as mce2 mutants of M. tuberculosis are attenuated in mice although the latter shows no growth defect in macrophages. The mce2 mutant of M. tuberculosis elicits an altered immune response and exhibits no lung pathology along with enhanced survival in mice (76–78). Likewise, MIP lacks phospholipase (plc) ABCD genes, which are responsible for acquiring host fatty acids for their use as a potential carbon source during persistent infections both in tuberculous and non-tuberculous mycobacterial infections (79). Another factor crucial for mycobacterial pathogenicity is associated with the presence of latency-related genes that confer on mycobacteria the ability to survive and grow in microaerophilic environment for prolonged period of time. The devS/devR two-component system, essential for maintenance of dormant state in low oxygen conditions, is conspicuous by its absence in MIP (79). In addition to RD1 locus and toxin–antitoxin system, in silico studies further identified MIP as a natural mutant of anthranilate phosphoribosyltransferase gene trpD, which is involved in tryptophan biosynthesis (81,82). The absence of these critical determinants may severely compromise MIP’s ability to survive inside the host as the infection with MIP has been found to be self-limiting and clears off within 6–7 weeks (17). The limited survival of MIP in low oxygen inside macrophages despite the absence of devS/devR two-component system can be attributed to the prevalence of ‘Hr’ proteins. In silico analysis revealed a much higher fraction of putative antigenic proteins in MIP in comparison with BCG (Figure 10), and a majority among them being contributed by lateral acquisitions emphasizing the importance of LGT events in augmenting its immune potential. Besides, the significant sequence heterogeneity observed between MIP and M. tuberculosis proteins (as mentioned earlier) would render MIP proteins acquiescent to generate novel T-cell epitopes resulting in an enhanced immune response. Our analysis revealed that of the 36 proteins shared by MIP and M. leprae, which were absent in M. bovis BCG, 29 were highly immunogenic in nature (Table 6). The most prominent putative antigenic proteins were MIP0340 and MIP5962, both belonging to Hsp20 family and share a close similarity with the 18 kDa small heat shock protein of M. leprae (83). This protein bears several T-cell epitopes and generates CD4+ T-cell mediated immune response, a hallmark of protection against tuberculosis. Similarly, MIP7697 is a homolog of M. leprae protein MLep2649 that encodes a protein with excellent T-cell stimulating properties, which responds to more than 60% of tuberculosis patients (84). The presence of such immunodominant and productive antigens in MIP may potentiate the expression of an antigenic profile better than BCG against M. tuberculosis infection. Figure 10. Comparative analysis of immunomes of MIP and BCG and contribution of LGT events. In silico immunome analysis of MIP and its comparison with BCG revealed the presence of a greater number of antigenic proteins in MIP (41). This may subscribe to the unique potential of MIP for immunomodulation against various types of infections. Noteworthily, a significant proportion of these immunogenic proteins appear to be laterally acquired in MIP. Table 6. List of MIP ORFs shared between MIP and M. leprae and absent from BCG aAs predicted by in silico analysis of MIP proteins by VAXIJEN software at default parameters (41). In summary, different analyses performed in this study establish that MIP represents an organism at a unique phylogenetic point as the immediate predecessor of opportunistic mycobacterial species of MAC. It is also evident that natural selection in MAC has acted in a preferential manner on specific categories of genes leading to reduced habitat diversity of pathogenic bacteria, and thus facilitating host tropism. The genome of MIP is ∼5.6 Mb in size and is shaped by a large number of lateral gene acquisitions thus revealing, for the first time, mosaic architecture of a mycobacterial genome. Thus, this study offers a paradigm shift in our understanding of evolutionary divergence, habitat diversification and advent of pathogenic attributes in mycobacteria. A scenario for mycobacterial evolution is envisaged wherein the earliest evolving soil derived mycobacterial species like MIP underwent massive gene acquisitions to attain a unique soil–water interface habitat before adapting to an aquatic and parasitic lifestyle. These lateral acquisition events were selective and possibly facilitated by the presence of specific genetic factors (i.e. ComEC) that induce competence to acquire large chunks of DNA to confer immediate survival advantage to the recipient organism. The genes, such as members of ‘Hr’ family, acquired to assist mycobacteria survive in fluctuating oxygen levels, would have been instrumental in the initial advent of pathogenicity in the aquatic opportunistic mycobacterial species. Subsequently, mycobacterial species tuned their genetic repertoires to respective host adapted forms with a high degree of genomic fluidity aided by selective lateral gene acquisitions and gene loss by deletion or pesudogenization (19). Importantly, a significant increase in transposon elements in the pathogenic mycobacteria as compared with MIP, for the first time, suggests their possible role toward mycobacterial virulence and would be interesting to explore. In addition, comparative genomic analysis revealed a higher antigenic potential of MIP subscribing to its unique ability for immunomodulation against various types of infections and presents a template to develop reverse genetics based approaches to design better strategies against mycobacterial infections.
Title	RESULTS AND DISCUSSION
Section	Genome sequencing and general features of MIP genome Sequencing of MIP (DSM 45 239T) genome was carried out by whole genome shotgun (WGS) approach. A total of 109 792 paired end reads, comprising of more than 10× coverage of MIP genome, were generated from randomly picked shotgun clones from both ∼2 and ∼5 kb shotgun libraries followed by gap filling and sequence improvement. Sequence assembly with PHRAP resulted in the assembly of 93 592 shotgun sequences leading to a single circular MIP chromosome of 5 589 007 bp (Figure 1). This was subsequently validated by a BAC end sequence based physical map of MIP genome. Mycobacterial genomes range from 3.5 to 7 Mb and MIP with a size of ∼5.6 Mb represents a moderate genome size, which is larger than all known organisms of MAC. The genome contains 5270 predicted ORFs (at a density of ∼1 gene/kb), a single rRNA operon and 45 tRNA genes; these ORFs account for ∼91% of the genome (Table 2). The mean G + C content of MIP genome is 68%. However, the cumulative nucleotide skew analysis revealed several regions with a G + C content clearly divergent from this mean value, which cover considerable area in MIP genome and constitute potential sites to investigate for laterally acquired genes (Figure 1). The putative ‘ori’ in MIP genome was identified by a relatively AT rich region with characteristic DnaA boxes and a typical gene order of ‘rpnP-dnaA-dnaN’. The ‘ATG’ was found to be the most frequent start codon (56.5%) followed by ‘GTG’ (37.5%) and ‘TTG’ (5.9%). Like M. tuberculosis, MIP has an even distribution of ORFs on both strands with respect to the direction of replication (2656 on the leading strand and 2614 ORFs on lagging strand) (4). PSORTB analysis indicated that 55.5% of MIP proteins are cytoplasmic in nature, 13.5% are localized in the cytoplasmic membrane and only 3.5% are extra-cellular in nature (26). However, the precise localization of 27.5% of the proteins could not be ascertained. Figure 1. Circular representation of MIP genome. Whole genome sequencing of MIP revealed that it harbors a single circular chromosome of 5 589 007 bp. The accuracy of genome data assembly is ensured by a BAC end sequence based physical map of MIP genome. The size of MIP genome is much larger than the genome of any member of M. avium complex and thus is in agreement with the progenitor status of MIP (19). The red and blue tracks represent ORFs predicted in the sense and anti-sense orientation in relation to the ori (origin of replication). The inner most track represents the GC skew wherein sharp peaks of violet and yellow represent regions of AT and GC richness, respectively, and constitute potential targets for lateral gene analysis . Table 2. General genomic features of MIP BLAST-based comparative analysis of MIP ORFs (at a cut off value of ≥70% amino acid identity) revealed their maximum similarity with MAC organisms, which are evolutionarily close to MIP (Supplementary Figure S1). This is followed by M. marinum with which MIP shares over 51% of its coding sequences (CDS) (Supplementary Table S1). This observation is consistent with the status of MIP as the progenitor of MAC and supports the idea of a shared aquatic past between saprophytic and pathogenic mycobacteria (19,45). With M. tuberculosis, MIP shares only ∼40% of its proteins. However, the number of MIP ORFs (∼68%) shared by closely related MAC species strikingly differs in comparison with other related mycobacteria, which usually share over 90% of coding sequences even at identity >95% (22). This divergence could be a critical component for the elicitation of a robust yet unique immune response upon vaccination with MIP.
Title	Genome sequencing and general features of MIP genome
Figure caption	Figure 1. Circular representation of MIP genome. Whole genome sequencing of MIP revealed that it harbors a single circular chromosome of 5 589 007 bp. The accuracy of genome data assembly is ensured by a BAC end sequence based physical map of MIP genome. The size of MIP genome is much larger than the genome of any member of M. avium complex and thus is in agreement with the progenitor status of MIP (19). The red and blue tracks represent ORFs predicted in the sense and anti-sense orientation in relation to the ori (origin of replication). The inner most track represents the GC skew wherein sharp peaks of violet and yellow represent regions of AT and GC richness, respectively, and constitute potential targets for lateral gene analysis .
Table caption	Table 2. General genomic features of MIP
Section	Functional classification of MIP proteins To facilitate functional studies, MIP proteins were subjected to BLAST analysis against the COG database, which serve as a platform for functional annotation of newly sequenced genomes and for studies on genome evolution (27). On the basis of similarity with COG proteins, it was possible to assign functions to ∼80% of MIP proteins but ∼20% of the proteins still remain un-annotated. More significantly, ∼7.5% of proteins are unique to MIP and show no significant homology with other proteins present in mycobacterial proteomes. Several of these candidate orthologs are present in gene clusters, which are absent from most of the other mycobacteria, and thus indicating the modular nature of gene acquisitions or deletions in mycobacteria. Our analysis shows that 41.5% of MIP proteins belong to ‘Metabolism’ category, 11.5% to ‘ISP’ (information storage and processing), and 9.5% to ‘CPS’ (cellular processes and signaling) whereas 16.7% are ‘poorly’ categorized proteins (Figure 2). Within ‘Metabolism’ category, the genes pertaining to lipid transport and metabolism (I) were over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins were related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%) and translational, ribosomal structure and biogenesis (J) (24.5%). In case of ‘CPS’, major representation comes from cell wall/membrane/envelope biogenesis (M) (27.6%) followed by posttranslational modifications (O) and signal transduction mechanisms (T) at 23 and 21.4%, respectively (Figure 2). Figure 2. Functional classification of MIP proteins. (A) Representation of MIP proteome based on the similarity of its proteins with COG database (27). (B) represents distribution in cell processing and signaling category (CPS), (C) denotes distribution of poorly characterized proteins in MIP while (D) and (E) stands for information storage and processing (ISP) and ‘metabolism’ related genes, respectively. It is evident that ∼42% of total MIP genes are involved in basic metabolic functions and ∼21% do not have any homology in COG database. Within ‘metabolism’ category, the genes involved in lipid transport and metabolism (I) are over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins are related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%).
Title	Functional classification of MIP proteins
Figure caption	Figure 2. Functional classification of MIP proteins. (A) Representation of MIP proteome based on the similarity of its proteins with COG database (27). (B) represents distribution in cell processing and signaling category (CPS), (C) denotes distribution of poorly characterized proteins in MIP while (D) and (E) stands for information storage and processing (ISP) and ‘metabolism’ related genes, respectively. It is evident that ∼42% of total MIP genes are involved in basic metabolic functions and ∼21% do not have any homology in COG database. Within ‘metabolism’ category, the genes involved in lipid transport and metabolism (I) are over-represented (22.5%) closely followed by secondary metabolites biosynthesis, transport and catabolism (Q) (21.4%). In the ‘ISP’ category, majority of the proteins are related to transcription (K) (48.5%) followed by replication, recombination and repair (L) (26%).
Section	Comparative proteome analysis of MIP with other species reveals the role of genomic fluidity in habitat diversification in Mycobacterium COG-based comparative analysis of gene distribution across mycobacterial proteomes highlights the presence of distinct genome fluidity. ‘ISP’ and ‘Metabolism’ proteins vary considerably with the maximum flexibility being observed in replication, recombination and repair (L), lipid transport and metabolism (I) and secondary metabolites biosynthesis and transport (Q), respectively (Figure 3). The minimum variations are observed in ‘CPS’ with nearly all sub-categories exhibiting a consistent representation. In ‘ISP’, the distribution of genes across all mycobacterial proteomes is almost consistent for translation, ribosomal structure and biogenesis (J) and chromatin structure and RNA processing (B), while a clear genomic fluidity is exhibited by the genes belonging to replication, recombination and repair (L). This category is least represented in MIP (3%) and maximally in M. ulcerans (10%). Similarly, the genes belonging to category K (transcription) are least represented in CDC1551 (5%) and maximally in M. smegmatis (9.4%), which is consistent with its saprophytic habitat. Figure 3. Comparative analysis of distribution of different mycobacterial proteomes under various COG functional categories. Different mycobacterial proteomes were downloaded from NCBI and subjected to COG-based BLAST analysis. The contribution of each functional category was calculated to observe the pattern of relative gene distribution across different mycobacterial species and plotted on this graph. (A) Distribution across ‘Metabolism’ category and various sub categories, (B) cell processing and signaling (CPS) and (C) information storage and processing (ISP). ‘X’ and ‘Y’ axis represent mycobacterial species and the number of mycobacterial proteins (in percentage), respectively. Our comparative analysis clearly highlights the presence of distinct genome fluidity in mycobacterial species across different functional groupings of genes. This genomic fluidity within different functional groups of proteins may contribute to the habitat diversification observed in mycobacterial species. In ‘Metabolism’, while the genes related to nucleotide transport and co-enzyme transport show a consistent distribution, the genes belonging to secondary metabolite biosynthesis and transport (Q), amino acid transport (E) and lipid transport and metabolism (I) show major quantitative variations. ‘I’ has the maximum representation in MAC like MAH (∼11%), MAP (10%) followed by MIP (9.5%) while ‘E’ and ‘Q’ are best represented in M. smegmatis (9.5%) and MAC organisms (9–10%), respectively (Figure 3). In case of carbohydrate transport and metabolism (G), all mycobacterial species have almost an equal representation except M. smgematis, which harbors almost twice (7%) the percentage of genes dedicated for this function in other mycobacterial species. In most COG categories, M. leprae seems to have a distinctly biased distribution of proteins probably indicative of the extensive gene-loss that the organism has undergone during evolution (21). Of all the mycobacteria, MIP has the least representation in L (3%) and E (amino acid transport) (4.3%) categories of genes. Although the distribution of genes is a species-specific attribute, variations in gene distribution across different lineages could provide an idea about the role of genomic fluidity in shaping the behavior of mycobacteria as saprophytes or host-adapted pathogens. Hence, to get a comprehensive picture of habitat transformation, mycobacterial species were classified in two groups according to their known attributes: pathogenic (PGN) comprising of M. tuberculosis complex (including M. marinum and M. ulcerans) and M. avium complex and saprophytic or environmental (ENV) mycobacteria comprising of M. smegmatis, M. vanabaalenii, M. gilvum and others. MIP was placed in between saprophytic and pathogenic mycobacterial species because of its unique intermediate position and these two groups were investigated for effect of gene variations in different COG classes with MIP as a common background (4). A two-way ANOVA analysis was performed to ascertain the statistical significance of analysis. While the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes, i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function) [P < 0.001, 0.01, 0.001, 0.01, 0.001 and 0.001, respectively], ENV–PGN transitions involved extensive gene variations (Figure 4). In addition to the gene reduction observed in the earlier mentioned classes, reduction was also noticed in genes related to energy metabolism (C, G, I and Q [P < 0.05, 0.05, 0.001 and 0.05, respectively]) and a significant increase in L (replication, recombination and repair, P < 0.01) and N (cell motility and secretion, P < 0.01) related genes in ENV-PGN transition. Noticeably, the habitat change from MIP to PGN lineages was primarily due to the loss of genes involved in I (lipid transport and metabolism, P < 0.001) and Q (secondary metabolite biosynthesis and transport, P < 0.001) and gain of genes in L, E and S [P < 0.001, 0.001 and 0.05, respectively] (Figure 4). This observation augurs well for a reduced habitat diversity of pathogenic mycobacteria and indicated toward the role of genomic fluidity within selected gene functions towards habitat specification. An increase in the representation of ‘L’ with the advent of pathogenicity offers an interesting paradigm, which warrants further studies in the model organisms. Figure 4. Quantitative analysis of gene variations involved in habitat transformation in mycobacteria. This cartoon depicts variations across major functional gene groupings as mycobacterial species adapted to a pathogenic lifestyle from free-living environmental mycobacteria. Red lines denote loss of genes while the green ones denote the gene gain with a change of habitat. Although the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function), ENV–PGN transitions involved extensive gene variations and are consistent with the intermediate evolutionary position of MIP. Significantly, only two major gene categories reported gain of genes associated with the advent of pathogenicity: L (DNA replication, recombination and repair) and E (amino acid transport and metabolism). But transition from purely saprophytic lineage to pathogenic habitat is associated with genes categorized into ‘L’ only, which also contains transposon elements. Indeed, we observe that saprophytic mycobacterium like MIP has only 38 transposons-like elements compared with 302 found in similar-sized pathogenic mycobacterial species M. ulcerans. A two-way ANOVA analysis was used to ascertain statistical significance.
Title	Comparative proteome analysis of MIP with other species reveals the role of genomic fluidity in habitat diversification in Mycobacterium
Figure caption	Figure 3. Comparative analysis of distribution of different mycobacterial proteomes under various COG functional categories. Different mycobacterial proteomes were downloaded from NCBI and subjected to COG-based BLAST analysis. The contribution of each functional category was calculated to observe the pattern of relative gene distribution across different mycobacterial species and plotted on this graph. (A) Distribution across ‘Metabolism’ category and various sub categories, (B) cell processing and signaling (CPS) and (C) information storage and processing (ISP). ‘X’ and ‘Y’ axis represent mycobacterial species and the number of mycobacterial proteins (in percentage), respectively. Our comparative analysis clearly highlights the presence of distinct genome fluidity in mycobacterial species across different functional groupings of genes. This genomic fluidity within different functional groups of proteins may contribute to the habitat diversification observed in mycobacterial species.
Figure caption	Figure 4. Quantitative analysis of gene variations involved in habitat transformation in mycobacteria. This cartoon depicts variations across major functional gene groupings as mycobacterial species adapted to a pathogenic lifestyle from free-living environmental mycobacteria. Red lines denote loss of genes while the green ones denote the gene gain with a change of habitat. Although the transition from ENV-MIP was associated with a significant reduction restricted to a few COG classes i.e. K (transcription), T (signal transduction), E (amino acid transport), P (inorganic ion transport and metabolism), R (general function) and S (unknown function), ENV–PGN transitions involved extensive gene variations and are consistent with the intermediate evolutionary position of MIP. Significantly, only two major gene categories reported gain of genes associated with the advent of pathogenicity: L (DNA replication, recombination and repair) and E (amino acid transport and metabolism). But transition from purely saprophytic lineage to pathogenic habitat is associated with genes categorized into ‘L’ only, which also contains transposon elements. Indeed, we observe that saprophytic mycobacterium like MIP has only 38 transposons-like elements compared with 302 found in similar-sized pathogenic mycobacterial species M. ulcerans. A two-way ANOVA analysis was used to ascertain statistical significance.
Section	Role of natural selection in speciation in Mycobacterium Measurement of the rate of non-synonymous (leading to change in amino acid) and synonymous (silent) nucleotide substitutions in protein-coding DNA sequences is the most referred criterion for detecting natural selection in molecular evolutionary analysis (46). Significantly higher non-synonymous nucleotide substitutions (Ka) over the synonymous (Ks) ones are interpreted as an evidence of positive natural selection. Hence, to understand the contribution of selection in speciation, we have used closely related and phylogenetically independent species of M. avium complex of which MIP is a predecessor (19). Orthologs were identified using Inparanoid tool (29) and dataset of ∼2600 gene pairs representing >80% of the orthologs shared among different species of MAC was obtained to perform comparative analysis of the rate of selection for human-adapted (MIP–MAH) and animal-adapted (MIP–MAP) niches from a saprophytic MIP. The evaluation of rate of selection (Ka/Ks) revealed strong purifying selection (∼0.06) acting on both human and animal adapted lineages. However, further resolution of analysis based on protein function revealed a significant difference in the rate of selection only for the genes involved in energy production and conversion (C) (P < 0.03, unpaired t test) (Figure 5). Also, very few genes, mostly distributed in metabolic pathways, were found to have undergone strong positive selection (Supplementary Table S2) suggesting their relevance in undergoing niche-specific adaptations in Mycobacterium. A very strong positive selection (>50 times of average rate) was observed in ComEC (MIP2580) (47), the competence protein required for exogenous DNA uptake during natural transformation, which can critically influence the ability to acquire foreign DNA in microbial species. Incidentally, we found this gene to be pseudogenized in MAP, which usually results from an excessive positive selection. In a recent study (48) based on SNP analysis, it was argued that recombination may influence the rate of selection in extremely closely related species of M. tuberculosis complex (average nucleotide identity >98% across different species). Even though MIP is likely to have minimal homologous recombination events because of sequence heterogeneity with MAP and MAH, the likelihood of recombination and lateral gene transfer influencing the rate of selection cannot be completely discounted. Figure 5. Role of natural selection in speciation in MAC. Analysis of average rate of natural selection (Ka/Ks) among MIP–MAP and MIP–MAH lineages revealed the presence of a similar purifying selection (46). This implies that both mycobacterial lineages have undergone an independent evolution into their respective host adapted forms from MIP. However, a significant skew in selection rate (Ka/Ks) is observed in genes categorized into energy production and conversion (C), and thus establish the role of metabolism-related genes in the evolution of host tropism. Also, a strong positive selection (>50 times of average selection) was observed on ComEC gene that encodes a competence protein required for DNA uptake and natural transformation (47). Such a strong selection on this gene indicates that ComEC has played an important role in modulating the efficiency of DNA uptake during mycobacterial evolution. In fact, in the case of MAP, this gene is found to be pseudogenized, which usually results from an excessive positive selection.
Title	Role of natural selection in speciation in Mycobacterium
Figure caption	Figure 5. Role of natural selection in speciation in MAC. Analysis of average rate of natural selection (Ka/Ks) among MIP–MAP and MIP–MAH lineages revealed the presence of a similar purifying selection (46). This implies that both mycobacterial lineages have undergone an independent evolution into their respective host adapted forms from MIP. However, a significant skew in selection rate (Ka/Ks) is observed in genes categorized into energy production and conversion (C), and thus establish the role of metabolism-related genes in the evolution of host tropism. Also, a strong positive selection (>50 times of average selection) was observed on ComEC gene that encodes a competence protein required for DNA uptake and natural transformation (47). Such a strong selection on this gene indicates that ComEC has played an important role in modulating the efficiency of DNA uptake during mycobacterial evolution. In fact, in the case of MAP, this gene is found to be pseudogenized, which usually results from an excessive positive selection.
Section	Identification of laterally transferred genes reveals massive gene acquisitions and mosaic architecture of MIP genome Identification of laterally acquired genes is an important paradigm, which is cardinal to gain a deeper insight into microbial evolution (49). Hence, after analysing the role of natural selection in speciation, we were keen to analyse the contribution of lateral gene acquisitions in MIP. The precise and accurate prediction of lateral gene transfer (LGT) events in an organism is challenging. First, detection of LGT may be influenced not only by source, size and quantity of lateral transfer but also by the genetic features associated with the recipient or host genome (50). Besides, LGT takes place by a variety of means and different tools may be required for better detection of LGT based on specific mechanisms of gene transfer (51). It is also known that different surrogate methods detect lateral acquisitions of different antiquities (52). Hence, all LGT are not amenable to detection by a single parametric method and the application of a combination of different methods is recommended to improve sensitivity of detection in different possible situations (50). However, while the simple addition of predictions from individual methods may increase false-positive rates, the consideration of strictly overlapping predictions as the inclusion criteria for LGT predictions is counterproductive because of the limited overlap of genes observed between different approaches (52,53). Nonetheless, it has been argued that even if the errors inherent to these individual methods are added, the overall benefit is worthy (50). Hence, to predict laterally acquired genes in MIP, we used three different parametric methods based on anomalous GC content of each ORF, genomic signature analysis and Alien Hunter predictions to score for likely LGT candidates. The genes were scored as laterally acquired only if they were predicted more than once. This would not only provide sensitivity of detection but also reduce the number of false-positive predictions associated with individual methods. We further augmented our analysis by using information on phylogenetic approaches and phyletic distribution of MIP genes in other mycobacterial species as an additional stand alone criterion to score LGT genes (54). Analysis of atypical GC content of ORFs (34) identified ∼28.5% (1503/5270) of MIP genes as putative candidates for LGT. A similar analysis with M. tuberculosis (MTB), M. avium paratuberculosis (MAP) and M. avium subsp. hominissuis (MAH) could only identify 4.3, 6.5 and 11.3% genes, respectively (55). Genomic signature approach could identify ∼33% of MIP genes as candidate LGT’s as compared to MTB, MAP and MAH wherein this approach yielded only 6%, 6.3 and 9.3% genes, respectively. Alien Hunter (32) predicted 85 probable laterally acquired regions (AL) comprising of 1298 (24.63%) ORFs (Supplementary Table S3). The regions around ‘ori’ (15 kb on both sides, upstream as well as downstream) and one harboring ribosomal genes were excluded from the evaluation to remove any possible bias. By using similar criteria with Alien Hunter, however, we could identify putative laterally acquired genes in MTB (21.5%), MAP (15.35%) and MAH (24.2%). More than 42% of MIP genes predicted by Alien Hunter are also shared by genomic signature analysis. A similar analysis with MTB, MAP and MAH showed an overlap of 17% (148/865), 31% (215/678) and 33.7% (421/1247), respectively, between Alien Hunter–predicted genes and genomic signature–based predictions. After applying our ‘majority’ based inclusion criteria, ∼6.2% of the genes emerged as laterally acquired in MTB, while MAP and MAH have 8.3 and 10.2% genes as LGT, respectively. By using this approach, ∼34% of MIP genes emerged as laterally acquired, which is significantly higher than in other mycobacterial species analysed in this study. A comparative analysis of MIP ORFs based on their restricted distribution within the other mycobacterial genomes identified additional 939 ORFs as plausible lateral acquisitions. This included 362 ORFs harbored by 93 defined RRDs (regions of restricted distribution of genes) in MIP (Supplementary Table S4) and 261 ORFs present in alien regions. The incongruence observed in the phylogenetic analysis of some of these genes substantiated their laterally acquired nature. Overall, 50.5% (2664/5270) of MIP ORFs appear to be laterally acquired highlighting thereby the scale of evolutionary novelties undergone by this microbe (Figure 6). This study represents the first report of such massive gene acquisitions in mycobacteria and suggests mosaic architecture of MIP genome. Figure 6. Depiction of lateral gene acquisitions in MIP. Each column depicts one MIP gene and each row depicts one mycobacterial genome (total 18 genomes comprising of M. tuberculosis complex, M. avium complex and saprophytic mycobacterial species–see Materials and Methods and Supplementary Table S1 for species list); green and red denote presence and absence, respectively, of the MIP gene in other genomes. Pink denotes the regions predicted by Alien Hunter (32). Dark yellow represents RRDs, while black columns denote recently acquired genes identified by using atypical gene content and gene signatures (34, 35). Orange arrow denotes position of tRNA molecules, blue denotes genes with homologs in genomes other than mycobacteria, while brown denotes absence in the COG database. A very good overlap is observed between genes identified by using different methods. Most of the alien regions and RRD’s overlap with red, substantiating the effectiveness and accuracy of our approach. It is noteworthy that over 50% of MIP genome has emerged as laterally acquired, the highest reported so far for any of the mycobacterial species. The figure is scaled to approximation with each figure row denoting 1 Mb of genome and every tick mark denoting 100 kb along the lane. Analysis of laterally acquired genes by using COG functional classification revealed the maximum gain in lipid transport and metabolism category (I) followed by transcription (K)-related genes, which are usually under-represented among laterally acquired genes in prokaryotes (56). This was followed by the genes affiliated to secondary metabolites biosynthesis, transport and catabolism (Q) and energy production and conversion (C); these four categories together constitute ∼35% of the total lateral acquisitions. In addition, the LGT predictions based on atypical GC content of ORFs (34) and further validated by genomic signatures 1478 (28.05%) (Figure 7A) appear to retain their native genomic imprints, which are yet to be masked by natural selection. This points toward their relatively recent acquisition and hence, their likely source could be ascertained (33). Analysis based on genomic signatures revealed that majority of these recently acquired genes (∼85%) are most likely derived from actinobacterial species (Figure 7B) like Streptomyces (∼25%), Amycolatopsis (∼15%), Rhodococcus (7.5%) and Frankia (6.5%). These gene acquisitions might have been mediated by physical proximity and close interactions among different actinobacteria. Figure 7. Identification of recent lateral gene acquisitions in MIP and their analysis. (A) Individual gene signatures of recently acquired MIP genes along with the whole genome signature of MIP establish the alien nature of respective genes (33). These gene signatures are based on the frequency of distribution of tetranucleotide pattern across the whole genome and individual genes of MIP, which are color coded to generate a visual impression. (B) Distribution of recently acquired genes with respect to their most likely source of acquisition. Based on the genomic signatures, our analysis revealed that majority of these recently acquired genes (∼85%) are possibly derived from actinobacterial species.
Title	Identification of laterally transferred genes reveals massive gene acquisitions and mosaic architecture of MIP genome
Figure caption	Figure 6. Depiction of lateral gene acquisitions in MIP. Each column depicts one MIP gene and each row depicts one mycobacterial genome (total 18 genomes comprising of M. tuberculosis complex, M. avium complex and saprophytic mycobacterial species–see Materials and Methods and Supplementary Table S1 for species list); green and red denote presence and absence, respectively, of the MIP gene in other genomes. Pink denotes the regions predicted by Alien Hunter (32). Dark yellow represents RRDs, while black columns denote recently acquired genes identified by using atypical gene content and gene signatures (34, 35). Orange arrow denotes position of tRNA molecules, blue denotes genes with homologs in genomes other than mycobacteria, while brown denotes absence in the COG database. A very good overlap is observed between genes identified by using different methods. Most of the alien regions and RRD’s overlap with red, substantiating the effectiveness and accuracy of our approach. It is noteworthy that over 50% of MIP genome has emerged as laterally acquired, the highest reported so far for any of the mycobacterial species. The figure is scaled to approximation with each figure row denoting 1 Mb of genome and every tick mark denoting 100 kb along the lane.
Figure caption	Figure 7. Identification of recent lateral gene acquisitions in MIP and their analysis. (A) Individual gene signatures of recently acquired MIP genes along with the whole genome signature of MIP establish the alien nature of respective genes (33). These gene signatures are based on the frequency of distribution of tetranucleotide pattern across the whole genome and individual genes of MIP, which are color coded to generate a visual impression. (B) Distribution of recently acquired genes with respect to their most likely source of acquisition. Based on the genomic signatures, our analysis revealed that majority of these recently acquired genes (∼85%) are possibly derived from actinobacterial species.
Section	Mobile elements based gene acquisition in MIP are dominated by plasmid-mediated lateral gene transfers LGT events are usually mediated by mobile genetic elements like phages, transposons and plasmids. BLAST analysis of laterally acquired genes against ACLAME (36), a database of mobile elements comprising all known phage genomes, plasmids and transposons, indicated mobile elements as likely source to 27.4% of these putative laterally acquired ORFs (<e−20). Majority of these genes exhibit similarity with plasmids and extremely small fraction with phages (2%) and IS elements (∼1.2%). The relative paucity of phage and IS elements mediated gene acquisitions and abundance of plasmid-acquired genes in case of MIP is surprising. In comparison with other mycobacterial species of similar sized genomes such as M. ulcerans (chromosome size ∼5.6 Mb), which has 302 IS elements/transposons (23), MIP has merely 38 genes harboring sequences consistent with IS signatures. This is consistent with our earlier analysis based on genomic fluidity across different mycobacterial lineages where variation in the number of transposable elements, which are classified in category L, was found to be associated with habitat diversification. It is tempting to envisage that MIP may harbor specific genomic determinants that either provide immunity from phages and transposons, or else predispose MIP toward plasmid-based gene acquisitions. MIP has a relatively higher number of CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat) elements compared with other mycobacterial genomes (7 as compared with 1–2 in other mycobacterial genomes) (37). These CRISPR molecules not only provide immunity against invasion by phages and viruses (57) but also limit the mobility of IS elements in genome and help in their excision from genome (58). MIP also lacks RD1 region, the loss of which facilitates efficient conjugation with plasmids and other chromosomes to promote rapid acquisitions of genes (59). In addition, we found that MIP is particularly enriched in transporters of septal DNA translocator family (6 as against 1–2 present in other mycobacteria) (Table 3) that are known to bring out rapid acquisition of genes by mediating cell to cell DNA transfer during plasmid conjugation (60). The abundance of these genomic determinants coupled with the absence of RD1 locus may contribute to the propensity of MIP towards plasmid-mediated gene acquisitions. Table 3. Comparative transporter analysis of MIP with other mycobacterial species MSMEG, M. smegmatis; MAP, M. avium susbsp. paratuberculosis; MTB, M. tuberculosis.
Title	Mobile elements based gene acquisition in MIP are dominated by plasmid-mediated lateral gene transfers
Table caption	Table 3. Comparative transporter analysis of MIP with other mycobacterial species MSMEG, M. smegmatis; MAP, M. avium susbsp. paratuberculosis; MTB, M. tuberculosis.
Section	Effect of lateral gene acquisitions on different gene families in MIP Two distinct observations have emerged from this analysis: (i) a large number of lateral gene acquisitions in MIP have been mediated through mobile elements with only a small contribution through phages and (ii) gene distribution among laterally acquired regions in MIP follows a skewed pattern with respect to function as indicated by over-representation of genes belonging to certain categories. To know the influence of LGT events on distribution of genes across MIP gene families, we performed a comprehensive analysis that showed that CYP450 is the largest gene family in MIP with 66 members and a gene density of ∼12/Mb. This is remarkably high in comparison with other mycobacterial species such as M. tuberculosis (4.5/Mb), M. smegmatis (5.6/Mb) and M. marinum (7.1/Mb). The analysis of Cytochrome P450 database (http://drnelson.uthsc.edu/CytochromeP450.html (15 August 2012, date last accessed)) revealed that MIP harbors the highest number of genes from CYP450 family among prokaryotes sequenced so far and ∼ 46% (30/66) of these genes are laterally acquired. Approximately 27% of these genes are recent acquisitions, suggesting a recent expansion of CYP450 family. Approximately 11% of CYP450 genes were identified as unique by International CYP450 nomenclature commission and have been classified into three new families and two new sub-families of CYP450 (Table 4). The context-based analysis based on gene neighborhood suggested the role of these genes in the utilization of unusual carbon sources, a key to adaptability and survival of MIP at its most likely habitat at soil–water interface (19). Table 4. List of CYP450 ORFs unique to MIP aNomenclature as per International Committee of CYP450 nomenclature. ‘A1’ refers to the first member of a new CYP450 family, whereas ‘B1’ refers to the first member of a new CYP450 sub-family. MIP has 66 genes of PE–PPE family (16 PE and 50 PPE), which encompass complete repertoire of PPE genes present in various MAC species. These PPE genes appear to be selectively inherited by various species of MAC as evident from their distribution in MAP (36 genes), MAH (38 genes), M. avium subsp. avium (36 genes) and M. intracellulare (38 genes), respectively (61). This observation substantiates that evolutionarily MIP is a predecessor of MAC and endorses our earlier findings based on the rate of natural selection that speciation and habitat diversification has taken place independently from MIP. Comparative analysis of PE–PPE genes in MIP (with Nr database at NCBI) highlighted that five genes of PPE–SVP sub-family are unique to MIP. In addition, several of its PE genes are evolutionarily closer to those belonging to M. tuberculosis complex, which is in agreement with the shared evolutionary history of MIP and predecessors of M. tuberculosis complex (19). The presence of PPE gene family is unique to mycobacteria among prokaryotes; however, its origin remains unknown. A large number of these genes are laterally acquired in MIP prompting us to speculate that PE–PPE genes might have been introduced into mycobacteria through mobile elements. Majority of PE–PPE gene clusters in MIP harbor genes related to mobile function activity such as phages, tRNA or 13e12 repeats in their vicinity. Besides, several of these PE–PPE genes exhibited the presence of Ig-like motifs often present in the proteins of tailed double stranded DNA bacteriophage particles. However, the most clinching evidence about the origin of these PE–PPE genes in mycobacteria emerged from the presence of a PPE protein containing intact prophage of ∼40 kb that we observed in MAH genome during ACLAME analysis (36), a direct evidence for a phage-mediated acquisition of PE–PPE family members.
Title	Effect of lateral gene acquisitions on different gene families in MIP
Table caption	Table 4. List of CYP450 ORFs unique to MIP aNomenclature as per International Committee of CYP450 nomenclature. ‘A1’ refers to the first member of a new CYP450 family, whereas ‘B1’ refers to the first member of a new CYP450 sub-family.
Section	Hemerythrins: a versatile gene family laterally acquired in MIP Most surprising finding of MIP gene analysis, however, is the unusual presence of ORFs belonging to hemerythrin proteins (Hr) family, the oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids. MIP has 10 ORFs having significant similarity to hemerythrin (Hr) genes (Figure 8) in comparison with one or two copies in most prokaryotes (62). The prevalence of ‘Hr’ proteins strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface (OAI) and in the anoxic regions of the marine habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62). Figure 8. Distribution of hemerythrins in MIP. Domain mapping and BLAST searches indicated the presence of 10 ORFs belonging to hemerythrin genes in MIP. Hemerythrin proteins (Hr) are oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids and they usually have only to 1 or 2 copies in most prokaryotes (62). The abundance of these genes strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface and in the anoxic regions of the habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62,63). The ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could also provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style has been shown to grow at 0% oxygen, to reach a plateau in 3 days and die thereafter (64). MIP2918 and MIP2747 did not harbor a definite ‘Hr’ domain and are identified by BLAST searches against NCBI ‘Nr’ database. Domain mapping and comparative genomics with available mycobacterial genomes indicated the presence of two Hr domains in several MIP ORFs such as MIP2750, MIP2759, MIP5034 and MIP6380, a trait usually restricted only to proteobacteria (62). We could also identify putative homologs of ‘hr’ genes in mycobacteria such as M. marinum and M. ulcerans (1 each), M. tuberculosis (3), M. gilvum and M. smegmatis (4 each), M. avium complex (5), other environmental mycobacteria like M. sps. KMS, M. sps. JLS and M. vanabaalenii (three each) and none in the case of M. leprae. Considering the variations in the number of ‘hr’ genes in various species of mycobacteria and their significant sequence heterogeneity, it appears that acquisition of hemerythrins could have been a selective and independent event facilitating mycobacterial evolution. Incidentally, 90% of these ORFs in MIP are laterally acquired with over one-half of them being recent acquisitions. It should be noted that the efficiency of hemerythrins as oxygen storage proteins is directly dependent on oxygen concentration in its surrounding environment (63). The selective enrichment of MIP with Hr proteins and the ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style, can manage to grow at 0% oxygen, reach a plateau in 3 days and die thereafter (64).
Title	Hemerythrins: a versatile gene family laterally acquired in MIP
Figure caption	Figure 8. Distribution of hemerythrins in MIP. Domain mapping and BLAST searches indicated the presence of 10 ORFs belonging to hemerythrin genes in MIP. Hemerythrin proteins (Hr) are oxygen-carrying non-heme diiron binding proteins, which are usually present in lower invertebrates and annelids and they usually have only to 1 or 2 copies in most prokaryotes (62). The abundance of these genes strongly suggests the preference of MIP to inhabit water-columns or sediments, where they reside predominantly at oxic–anoxic interface and in the anoxic regions of the habitat or both as reported in the case of magnetotactic bacteria, which are also endowed with the abundance of hemerythrins (62,63). The ability of hemerythrins to reversibly bind to oxygen at higher oxygen concentrations and release it in anoxic conditions could also provide an explanation to intriguing behavior of MIP, which, not withstanding its aerobic life-style has been shown to grow at 0% oxygen, to reach a plateau in 3 days and die thereafter (64). MIP2918 and MIP2747 did not harbor a definite ‘Hr’ domain and are identified by BLAST searches against NCBI ‘Nr’ database.
Section	Membrane transporters in MIP Transport systems play a critical role in the life-endowing processes such as metabolism, metal homeostasis and secondary metabolite production, affecting thereby the physiology and lifestyle of the organism. In MIP, a total of 222 genes were annotated as membrane transporters comprising ∼4.2% of the total gene content and a transporter density of 39.73/Mb (Table 3). This is an apparent reflection on the unique evolutionary position of MIP as its transporter density is significantly lower than that of saprophytic M. smegmatis (60.43/Mb) and higher than M. tuberculosis (33.64/Mb), MAP (35.2/Mb) and M. leprae (17.2/Mb). It is likely that M. smegmatis and MIP, being saprophytic in nature need extensive transport machinery to support their life style, whereas intracellular organisms owing to their relatively stable environment have a reduced transporter requirement. Comparative analysis revealed a selective abundance of transporters belonging to septal DNA Translocator family (S-DNA-T) with distinct homology with FtsK/SpoIIIE proteins, which could primarily be responsible for high propensity of MIP toward plasmid conjugation and gene acquisitions (60). Another unique seven gene cluster of CPA3 (Proton Antiporter-3) family present in RRD79 (Supplementary Table S4), the largest region with restricted distribution, is responsible for species defining ability of MIP to grow on 5% NaCl (4). The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila (Figure 9) that can grow in high salt concentration (65). Along with Mn2+ transporters, MIP also possesses an abundance of catalases (4) and superoxide dismutases (5), some of them being laterally acquired, which mitigate oxidative stress and reflect not only upon the primitive origin of MIP but also equip it for intracellular adaptations. MIP can withstand the carbon starvation as evident from our experiments on nutritional stress. After 5 days of growth in PBS without any media or nutritional supplement, MIP exhibited no significant reduction in log CFU, reflecting upon its potential to undergo longer period of starvation. Thus, MIP appears to have fine-tuned its specific transport abilities by lateral gene acquisitions to gain physiological attributes required for its unique habitat. Figure 9. Phylogenetic analysis of CPA3 family cluster. This cluster is unique to MIP among mycobacterial species and each ORF of this complex encodes different subunits of a unique Na+/H+ antiporter. All genes have been laterally acquired as a unit, hence, a representative single gene is used to perform phylogenetic analysis by using maximum likelihood method available in Phylogeny Fr. Server (43). The numbers along the branches denote bootstrap values. The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila that can grow under salt conditions (3% NaCl w/v) (65).
Title	Membrane transporters in MIP
Figure caption	Figure 9. Phylogenetic analysis of CPA3 family cluster. This cluster is unique to MIP among mycobacterial species and each ORF of this complex encodes different subunits of a unique Na+/H+ antiporter. All genes have been laterally acquired as a unit, hence, a representative single gene is used to perform phylogenetic analysis by using maximum likelihood method available in Phylogeny Fr. Server (43). The numbers along the branches denote bootstrap values. The phylogenetic analysis established the proximity of these genes to Catenulispora acidiphila that can grow under salt conditions (3% NaCl w/v) (65).
Section	Genome-enabled ecophysiological and metabolic attributes of MIP and influence of LGT events The presence of a very high number of alternate sigma factors (24 in comparison with 13 in M. tuberculosis and MAP) endows MIP with a complex transcriptional flexibility necessary for it to respond to its unique life style (66). The interface life style (soil/water) of MIP, as substantiated by its genetic features, prompted us to look for the bio-degradative capabilities of MIP. In addition to the abundance of CYP450 genes, MIP has also laterally acquired homologs of 3-octaprenyl-4-hydroxybenzoate carboxy-lyase that are involved in anaerobic metabolism of phenol during degradation of plant substrates (67). Besides, as enlisted in Supplementary Table S5, it also possesses complete cyanide and thiocyanate biodegradation machinery including the complete enzyme complex (MIP3820-22) of thiocyanate hydrolase (alpha, beta and gamma subunits). This complex degrades thiocyanate and produces CO2 and NH3 that can be used by MIP as a nitrogen source (68). Notably, thiocyanate gene cluster is absent from all the pathogenic mycobacteria analyzed in this study (including M. abscessus and the opportunists of MAC). Although the ability of MIP to degrade different compounds and utilize diverse sources of carbon is perspicuous, the presence of an intact hydrogenases enzyme complex (Table 5) provides evidence for its chemolithotrophic nature even though further research is required to establish the functionality of this complex. The loss of hydrogenases concurs well with the advent of pathogenicity in mycobacteria across different lineages, an observation corroborated by previous studies on mycobacterial hydrogenases (69). Loss of accessory protein coding genes which are required for the maturation and assembly of the hydrogenase complex as well as integration of different metal ions renders this complex non-functional in immediate descendents of MIP (i.e. MAC species). Table 5. List of MIP ORFs encoding hydorgenase gene cluster In addition to these unique metabolic characteristics of MIP, fundamental differences were observed in the organization of lipid metabolic machinery, which is cardinal to the physiology and behavior of mycobacterial species (70). Although, the genetic machinery required for synthesis and modification of mycolic acids is present in MIP (70), a major reshuffle is observed in methoxy mycolic acid synthase gene operon. Sequence analysis further suggested the absence of papA5, a gene encoding polyketide-associated protein (Pap) required for the synthesis of virulence associated phthiocerol dimycocerosate (PDIM) (71). These traits are in agreement with the observation that members of MAC do not synthesize PDIM’s. Further analysis demonstrated the presence of a glycophospholipid (GPL) biosynthesis locus, which is a hallmark of antigenic diversity in MAC and appears to be laterally acquired in MIP. This gene cluster in MIP harbors an ORF (MIP4595) sharing significant similarity with ‘gsc’ gene of MAP. This gene constitutes a pathogenic island in pathogenic mycobacteria including M. tuberculosis (72). However, a comparative analysis of this locus with other MAC sequevars revealed the interruption of this locus by a six-gene cluster exclusive to MIP with four of these six genes being transposable elements. Thus, this GPL locus acts as a hotspot for transposon integration and is likely to play an important role in MIP’s unique biological attributes by influencing GPL biosynthesis.
Title	Genome-enabled ecophysiological and metabolic attributes of MIP and influence of LGT events
Table caption	Table 5. List of MIP ORFs encoding hydorgenase gene cluster
Section	Non-pathogenic attributes of MIP and immunome analysis MIP as discussed before is non-infectious in mouse, guinea pig and monkey models (2,6,17,73). However, investigation of MIP against PAIDB (74), the pathogenic islands database identified the presence of three regions in MIP with genomic attributes similar to PAGI islands of Pseudomonas aeruginosa. These included a gene cluster (MIP227–MIP247) similar to PAGI 1 pathogenic island of P. aeruginosa isolated from a patient with a urinary tract infection (39); a PAG3 like genomic island (MIP272–MIP283) and another region homologous to PAI (MIP333–MIP343), a region similar to tcd (toxin complex D) island of Photorhabdus luminescens (75). A comparative analysis of MIP proteins with virulence factors database (VFDB), a comprehensive compilation of all known virulence factors, also revealed the presence of most of the genes in MIP that are reportedly associated with virulence in other mycobacterial species (40). Pathogenesis is a multi-factorial phenomenon that requires pathogen to attach, infect, sustain, proliferate and eventually disseminate itself inside the host. Hence, the loss of a component responsible for any of these functions is likely to result in the attenuation of virulence or pathogenicity. Thus, despite having PE-PPE genes and mce1 operon, which enable mycobacteria to invade the host cell, MIP lacks both mce2 and mce3 operons, which are essential for causing macrophage infections by M. tuberculosis and M. avium (76–78). The mce3 as well as mce2 mutants of M. tuberculosis are attenuated in mice although the latter shows no growth defect in macrophages. The mce2 mutant of M. tuberculosis elicits an altered immune response and exhibits no lung pathology along with enhanced survival in mice (76–78). Likewise, MIP lacks phospholipase (plc) ABCD genes, which are responsible for acquiring host fatty acids for their use as a potential carbon source during persistent infections both in tuberculous and non-tuberculous mycobacterial infections (79). Another factor crucial for mycobacterial pathogenicity is associated with the presence of latency-related genes that confer on mycobacteria the ability to survive and grow in microaerophilic environment for prolonged period of time. The devS/devR two-component system, essential for maintenance of dormant state in low oxygen conditions, is conspicuous by its absence in MIP (79). In addition to RD1 locus and toxin–antitoxin system, in silico studies further identified MIP as a natural mutant of anthranilate phosphoribosyltransferase gene trpD, which is involved in tryptophan biosynthesis (81,82). The absence of these critical determinants may severely compromise MIP’s ability to survive inside the host as the infection with MIP has been found to be self-limiting and clears off within 6–7 weeks (17). The limited survival of MIP in low oxygen inside macrophages despite the absence of devS/devR two-component system can be attributed to the prevalence of ‘Hr’ proteins. In silico analysis revealed a much higher fraction of putative antigenic proteins in MIP in comparison with BCG (Figure 10), and a majority among them being contributed by lateral acquisitions emphasizing the importance of LGT events in augmenting its immune potential. Besides, the significant sequence heterogeneity observed between MIP and M. tuberculosis proteins (as mentioned earlier) would render MIP proteins acquiescent to generate novel T-cell epitopes resulting in an enhanced immune response. Our analysis revealed that of the 36 proteins shared by MIP and M. leprae, which were absent in M. bovis BCG, 29 were highly immunogenic in nature (Table 6). The most prominent putative antigenic proteins were MIP0340 and MIP5962, both belonging to Hsp20 family and share a close similarity with the 18 kDa small heat shock protein of M. leprae (83). This protein bears several T-cell epitopes and generates CD4+ T-cell mediated immune response, a hallmark of protection against tuberculosis. Similarly, MIP7697 is a homolog of M. leprae protein MLep2649 that encodes a protein with excellent T-cell stimulating properties, which responds to more than 60% of tuberculosis patients (84). The presence of such immunodominant and productive antigens in MIP may potentiate the expression of an antigenic profile better than BCG against M. tuberculosis infection. Figure 10. Comparative analysis of immunomes of MIP and BCG and contribution of LGT events. In silico immunome analysis of MIP and its comparison with BCG revealed the presence of a greater number of antigenic proteins in MIP (41). This may subscribe to the unique potential of MIP for immunomodulation against various types of infections. Noteworthily, a significant proportion of these immunogenic proteins appear to be laterally acquired in MIP. Table 6. List of MIP ORFs shared between MIP and M. leprae and absent from BCG aAs predicted by in silico analysis of MIP proteins by VAXIJEN software at default parameters (41). In summary, different analyses performed in this study establish that MIP represents an organism at a unique phylogenetic point as the immediate predecessor of opportunistic mycobacterial species of MAC. It is also evident that natural selection in MAC has acted in a preferential manner on specific categories of genes leading to reduced habitat diversity of pathogenic bacteria, and thus facilitating host tropism. The genome of MIP is ∼5.6 Mb in size and is shaped by a large number of lateral gene acquisitions thus revealing, for the first time, mosaic architecture of a mycobacterial genome. Thus, this study offers a paradigm shift in our understanding of evolutionary divergence, habitat diversification and advent of pathogenic attributes in mycobacteria. A scenario for mycobacterial evolution is envisaged wherein the earliest evolving soil derived mycobacterial species like MIP underwent massive gene acquisitions to attain a unique soil–water interface habitat before adapting to an aquatic and parasitic lifestyle. These lateral acquisition events were selective and possibly facilitated by the presence of specific genetic factors (i.e. ComEC) that induce competence to acquire large chunks of DNA to confer immediate survival advantage to the recipient organism. The genes, such as members of ‘Hr’ family, acquired to assist mycobacteria survive in fluctuating oxygen levels, would have been instrumental in the initial advent of pathogenicity in the aquatic opportunistic mycobacterial species. Subsequently, mycobacterial species tuned their genetic repertoires to respective host adapted forms with a high degree of genomic fluidity aided by selective lateral gene acquisitions and gene loss by deletion or pesudogenization (19). Importantly, a significant increase in transposon elements in the pathogenic mycobacteria as compared with MIP, for the first time, suggests their possible role toward mycobacterial virulence and would be interesting to explore. In addition, comparative genomic analysis revealed a higher antigenic potential of MIP subscribing to its unique ability for immunomodulation against various types of infections and presents a template to develop reverse genetics based approaches to design better strategies against mycobacterial infections.
Title	Non-pathogenic attributes of MIP and immunome analysis
Figure caption	Figure 10. Comparative analysis of immunomes of MIP and BCG and contribution of LGT events. In silico immunome analysis of MIP and its comparison with BCG revealed the presence of a greater number of antigenic proteins in MIP (41). This may subscribe to the unique potential of MIP for immunomodulation against various types of infections. Noteworthily, a significant proportion of these immunogenic proteins appear to be laterally acquired in MIP.
Table caption	Table 6. List of MIP ORFs shared between MIP and M. leprae and absent from BCG aAs predicted by in silico analysis of MIP proteins by VAXIJEN software at default parameters (41).
Section	ACCESSION NUMBERS MIP genome has been submitted to the genome depository at NCBI (accession no. CP002275).
Title	ACCESSION NUMBERS
Section	SUPPLEMENTARY DATA Supplementary Data are available at NAR Online: Supplementary Tables 1–5 and Supplementary Figure 1.
Title	SUPPLEMENTARY DATA
Section	FUNDING MIP Genome sequencing program was funded by the Department of Biotechnology, Government of India. V.S. acknowledges the Council of Scientific and Industrial Research (CSIR), New Delhi, for the award of research fellowship. Akhilesh K. Tyagi, Anil Kumar Tyagi and S.E. Hasnain are thankful to Department of Science and Technology, Government of India for J.C. Bose National Fellowships. S.E.H. is a visiting professor, King Saud University, Riyadh, Kingdom of Saudi Arabia and J.P.K. is a Tata Innovations Fellow. Funding for open access charge: University of Delhi, India. Conflict of intertest statement. None declared.
Title	FUNDING
Section	Supplementary Material Supplementary Data
Title	Supplementary Material
Title	Supplementary Data

Annnotations

blinded

PMC:3505973 JSONTXT 3 Projects

Document structure show

Annnotations

PMC:3505973 JSON TXT 3 Projects