PMC:5056897 JSONTXT 6 Projects

Sequence Analysis of Hypothetical Proteins from Helicobacter pylori 26695 to Identify Potential Virulence Factors Abstract Helicobacter pylori is a Gram-negative bacteria that is responsible for gastritis in human. Its spiral flagellated body helps in locomotion and colonization in the host environment. It is capable of living in the highly acidic environment of the stomach with the help of acid adaptive genes. The genome of H. pylori 26695 strain contains 1,555 coding genes that encode 1,445 proteins. Out of these, 340 proteins are characterized as hypothetical proteins (HP). This study involves extensive analysis of the HPs using an established pipeline which comprises various bioinformatics tools and databases to find out probable functions of the HPs and identification of virulence factors. After extensive analysis of all the 340 HPs, we found that 104 HPs are showing characteristic similarities with the proteins with known functions. Thus, on the basis of such similarities, we assigned probable functions to 104 HPs with high confidence and precision. All the predicted HPs contain representative members of diverse functional classes of proteins such as enzymes, transporters, binding proteins, regulatory proteins, proteins involved in cellular processes and other proteins with miscellaneous functions. Therefore, we classified 104 HPs into aforementioned functional groups. During the virulence factors analysis of the HPs, we found 11 HPs are showing significant virulence. The identification of virulence proteins with the help their predicted functions may pave the way for drug target estimation and development of effective drug to counter the activity of that protein. Introduction Helicobacter pylori is a Gram-negative bacteria that is associated with several gastric problems in human. It is a slow growing microaerophilic bacteria [1]. Its spiral shape flagellated body helps in locomotion and invasion on the host cells. It belongs to the class of bacteria that are responsible for most common bacterial infections in human [2]. It is adapted to the acidic gastric environment for survival. It is also indigenous to the worldwide human population. It was first isolated by Marshall and Warren in 1984 [3, 4, 5]. Prolonged infection of the organism can be transformed into a chronic infection that causes severe gastric diseases such as duodenal ulcer, gastric ulcer, gastric lymphonema and cancer [6, 7]. Nonchronic infection of the bacteria is usually asymptomatic. There is usually no development of clinical disease observed in the infected person. The prevalence of infection is also guided by the variations in geographical conditions, age, race, and socioeconomic status of the infected persons [8, 9, 10]. A person having bacterial infection at an early age is more prone to develop a chronic infection [11, 12, 13]. H. pylori infection in developing countries is higher in comparison to the developed countries. The reason behind this may be poor hygiene practices in the developing countries [14]. The H. pylori genome was first sequenced in 1997 [5]. The genome of H. pylori 26695 strain (NC_000915.1) contains 1,555 coding genes and 65 pseudogenes. The GC content of the genome is 38.9%. The coding genes in the genome encode 1,445 proteins, seven rRNAs, and 36 tRNAs. The genome contains 340 predicted gene products characterized as hypothetical proteins (HPs). In this study, we have analyzed the sequences of all the HPs from H. pylori to assign probable functions. The objective is to identify putative virulence proteins in the proteome that help in pathogenesis. We have used an established protocol [15, 16] for the function prediction of the HPs that comprises leading bioinformatics tools and databases [17, 18, 19]. The analysis goes in a systematic way of predicting physicochemical properties of the proteins using ProtParam. Then, subcellular localization using different programs is carried out to assist the function prediction. Identification of transmembrane helices (TMHs) in the HPs to find out membrane protein is carried out using TMHMM and HMMTOP. We have analyzed the HPs for similarity searching using Basic Local Alignment Search Tool (BLAST). Protein-protein interaction is helpful in assessing the function of novel proteins. We have used Search Tool for the Retrieval of Interacting Genes (STRING) database for predicting protein-protein interaction networks for the HPs. The classification of the HPs is done using CATH, Structural Classification of Proteins (SCOP), Pfam, SVMProt, and Protein Analysis through Evolutionary Relationships (PANTHER) database. Conserved domain discovery and motif search in the HPs are carried out using Conserved Domain Architecture Retrieval Tool (CDART), Simple Modular Architecture Research Tool (SMART), InterProScan, and Motif, respectively. We have made final predictions on the basis of a consensus approach [20, 21, 22]. The putative function predicted by four or more programs for an HP is considered the probable function of that HP with high precision and high confidence [17, 23]. Finally, we have successfully assigned putative functions to 104 HPs out of 340 HPs with high precision. Furthermore, we have classified proteins on the basis of their involvement in the various biological process and predicted molecular functions into diverse functional groups such as enzymes, binding proteins, transporters, and proteins involved in cellular processes and into the proteins exhibiting miscellaneous functions. Methods Data abstraction In this study, the primary source of genome data is National Center for Biotechnology Information (NCBI) genome database. We extracted preliminary information using "Helicobacter pylori" string that redirects to the genomewide project report of H. pylori genome. We selected H. pylori 26695 strain from the database with the Accession Code RefSeq NC_000915.1. The genome contains 1,555 genes coding for 1,445 proteins. We then extracted the hypothetical proteins from the pool of 1,445 proteins. We used Uniprot for retrieving Uniprot IDs and fasta sequences of the HPs using their Protein Product IDs (e.g., NP_206816.1). Fasta sequences of the HPs retrieved from Uniprot were used for further analysis. Physicochemical parameterization Physicochemical properties of the proteins help in deducing the biochemical characteristics of the proteins and functional characterization. We used Expassy's ProtParam [24] server for estimation of physicochemical parameters of the HPs. ProtParam server is equipped with modules that are capable of predicting an array of physicochemical properties using predefined formulas and experimental inferences. We predicted relative molecular weight, theoretical pI, extinction coefficient, instability index, aliphatic index, and grand average of the hydropathicity of the HPs using ProtParam. All these properties help in identifying the probable function of the proteins. Data for physicochemical parameterization are listed in Supplementary Table 1. Sub-cellular localization The function of a protein is very well influenced by its location in the cellular space. For instance, proteins of the exoproteomic pool and secretory proteins often play an essential role in virulence related activities such as adherence to host cells. We used an array of tools to carry out the subcellular localization of the HPs. We used PSORTb [25], PSLpred [26], and Cello [27] to predict the location of HPs in the cell. These predictors use experimental data from known proteins to make predictions for query proteins using their fasta sequences. They predict the possible occurrence of protein in diverse cellular or extracellular localities such as cytoplasm, periplasm, inner membrane, outer membrane, or extracellular space. To predict signal peptides in the HPs, we used SignalP [28] prediction platform for the existence of signal peptides in the HPs, which is a characteristic feature of membrane-bound proteins. SecretomeP [29] server was used to find out nonclassical secretory proteins among the HPs. Prediction of TMHs in the proteins helps in the identification of membranous proteins. We used HMMTOP [30] and TMMHMM [31] for this purpose. Both these programs use Hidden Markov Model (HMM) profiles of training data set to predict TMHs in query sequences. The supplementary data are given in Supplementary Table 2. Identification of virulence proteins The present work put stress upon the identification of potential virulence proteins in the pool of HPs. Pathogenic bacteria contain a range of virulence proteins in their pathogenesis machinery. There are adhesins, exotoxins, endotoxins, and secretion systems, etc., that comprise the virulence moiety of pathogenic bacteria. We used VirulentPred [32] and VICMpred [33] for the identification of virulence factors among the HPs. Both these tools are Support Vector Machine (SVM) based using 5-fold cross-validation processes to validate the results. VirulentPred uses the strategy of two-way predictions, i.e., non-Virulent or Virulent whereas VICMpred categorizes proteins into four classes namely proteins involved in cellular processes, metabolism protein, information molecule, and virulence factors. It has a training set of 670 proteins from Gram-negative bacteria including 70 known virulence factors. Information for virulence factors analysis is provided in Supplementary Table 3. Homology and function prediction The assertion of homology between proteins derived on the basis of sequence similarity provides insights into the functional properties of an unknown protein showing similarity with a protein of known function. BLAST [34, 35, 36, 37] is a commonly used and most reliable tool for the purpose. Structure and function prediction help to identify novel drug targets which can be further utilized for therapeutic intervention [38, 39, 40, 41, 42, 43, 44, 45, 46]. We used blastp module to search for homologous proteins to the HPs against a database of nonredundant protein sequences. To decrease the redundancy in the results, a threshold was set for the e-value less than 0.0005 and sequence identity more than 30%. SMART [47] was used for the function prediction. It uses information about domain architecture from known proteins and provides functional annotation of query sequences. Function prediction based on motif discovery was performed using InterProScan [48] and Motif. InterProScan searches the query sequence against Interpro consortium to bring about the function of the proteins using motif information. Motif operates as an interface between user and motif library of known databases. It searches the query sequence against Pfam, TIGRFAM, COG, SMART, PROSITE Patterns, and PROSITE profiles. The user has the facility to choose any of these databases. We also used STRING [49] to predict protein-protein interaction networks for the HPs. It gives functional insights for the HPs based on protein-protein interaction. Information for homology and function prediction is listed in Supplementary Table 4. Classification and domain assignment Protein classification and domain assignment using sequence similarity search may give ample evidence for function prediction of the HPs. We have used an array of databases and retrieval tools such as CATH [50], SUPERFAMILY [51], PANTHER [52], Pfam [53], CDART [54], SVMProt [55], and ProtoNet [56] for the classification of the HPs. CATH provides the classification of Protein Data Bank (PDB) protein structure repository. CATH v4.0 release contains 235,858 domains, 2,738 superfamily and 69,058 annotated PDBs. SUPERFAMILY database provides structure and functional annotation of proteins based on HMM using SCOP classification system. PANTHER is another efficient protein classification database based on HMM profiles. PANTHER provides a multi-way classification of proteins on the basis of family and subfamily, molecular function, involvement in a biological process, and association with a pathway in any cellular process. It reduces the risk of redundancy by applying strict HMM scoring strategy. We also used Pfam for the classification of HPs. Pfam is a database of protein families with representative multiple sequence alignments and HMMs for each family. SVMProt was also used for functional classification of the HPs. It is a SVM based classification software trained with the dataset of about 54 functional families of protein. We performed cluster-based classification of the HPs using ProtoNet. It gives a hierarchical classification of proteins using clusters of proteins showing functional similarity. The information about the classification of the HPs is given in Supplementary Table 5. Results Sequences of 340 HPs from H. pylori 26695 strain tested with exclusive pipeline developed by our group [23, 57]. We used several tools for the sequence analysis such as, BLAST, CATH, SCOP, CDART, InterProScan, Motif, protein family databases, conserved domain databases, protein cluster database, protein-protein interaction database, and other such analysis tools such as virulence predictors, subcellular localization prediction programs, etc. Data produced by all these methods and prediction programs help us deducing results. We successfully assigned probably functions to 104 HPs with high confidence (Table 1). As mentioned earlier, the basis of the confidence level was consensus based, i.e., the similar function for an HP predicted by four or more programs was considered function for the HP with high confidence and precision. To reduce redundancy and to maintain the reliability of the results, we deliberately omitted the HPs for which functions were predicted with low level and less precision. Discussion Classification of the HPs For the ease of the approach for understanding the probable involvement of these HPs in pathogenesis, we categorized all 104 HPs into various functional groups on the basis of their individual molecular function and their involvement in various biological processes (Fig. 1). We found 27 HPs showing similarities with various enzyme classes like oxidoreductases, hydrolases, transferases, etc. Ten HPs are categorized as transporters, 26 showing features of binding proteins, 23 HPs have predicted to be involved in various cellular and regulatory processes and 18 HPs are listed in the category of proteins showing miscellaneous functions. These HPs are further studied and extensively analyzed using previously available literature and experimental studies. Enzymes, having catalytic properties, play a substantial role in the life of a living organism to provide biochemical machinery for various cellular and regulatory processes. We found 27 HPs showing similarities experimentally characterized enzymes representatives of enzyme classes. HP O25317 showed similarity with disulfide bond formation protein DsbB. Disulfide bonds provide stability and maturation strength to the protein thus, DsbB has a critical role in the development of substantial protein machinery that may be involved in the metabolic or regulatory pathways [58] of that pathogen, therefore, helping in the pathogenesis. Out of 27 enzymes, five HPs are categorized as transferases. HPs O25589 and O25870 are showing similarity with acetyltransferase family protein and glycosyltransferase family 9 (heptosyltransferase), respectively. Both these HPs are predicted virulent in virulent factors analysis. Glycosyltranferases facilitate the "biosynthesis of disaccharides, oligosaccharides, and polysaccharides" by catalyzing the transfer of sugar moieties [59]. HP O25870 is predicted heptosyltransferase may be a potential drug target. Heptosyltranferase help in the formation of the core region of lipopolysaccharides which constitute the major component outer membrane structure in Gram-negative bacteria [60]. About 60% of all predicted enzymes belong to hydrolases enzyme class and most of them are involved in metabolic pathways. In the predicted hydrolases, there are ATPases, restriction endonucleases, phosphoesterases, etc., that facilitate the processes of transcription, translation, functional group localization, and other such essential activities that help in the development and propagation of the pathogen inside the host. There are four HPs showing similarities with member proteins of lyase enzyme class. HP O25309 is showing similarity with aminodeoxychorismate lyase and is predicted as virulent factor. Aminodeoxychorismate lyase is a class member of pyridoxal-phosphate-binding protein class IV which helps in the biosynthesis of tetrahydrofolate by aminodeoxychorismate to para-aminobenzoate. Tetrahydrofolate is an essential precursor in purine biosynthesis [61]. Transporters have always remained a subject of interest during the process of novel drug discovery against the pathogenic diseases. Transporters, due to their specific evolution making them capable of transporting essential molecules, are involved in a wide range of metabolic pathways and other important cellular processes. H. pylori genome has an ample amount of genes that encode a large number of transporter proteins, mainly ATP-binding cassette (ABC) transporters. In the predicted HPs, we found 10 HPs showing characteristic similarity with transporters. HPs O26020, and O26021 are showing similarity with ABC-2 family transporter proteins. ABC transporters, specific to prokaryotes, are the leading molecules that fulfill the energy requirement of the organism for diverse biological processes [62]. The required amount energy that they provide comes from the hydrolysis of ATP molecules performed by ABC transporters [63] having specifically evolved domains for ATP hydrolysis. We found HP O26042 is showing similarity with ferrichrome iron receptor (fhuA). Iron uptake is believed to be preferential activity in H. pylori for the survival in the host system [64]. fhuA is an outer membrane transport protein which catalyzes the transport of ferrichrome and also acts as a receptor for T5 phages in Escherichia coli and other toxic substances [65]. HP O26042 is also predicted virulent in virulence factors analysis. Thus, it can be considered potential drug target. Twenty-six HPs are characterized as binding proteins. These proteins are further specified according to their functions as adhensin, DNA-, RNA-, protein-, nucleotide-, metal- and lipid-binding proteins. Some of the representative members of this group are may be known involved in leading cell activities, transcription, translation, and other regulatory processes. In this group, we have identified four HPs showing characteristics of restriction modification proteins, three of which belong to type I and one belong to type II. All these proteins may have an essential role in DNA modification. HP O25934 is showing similarity with type-1 restriction enzyme ecoKI specificity protein (hsdS) and predicted virulent by both VICMpred and VirulentPred. Type-1 restriction enzyme ecoKI specificity protein belongs to the class of S-adenosyl-L-methionine dependent endonucleases that are constituents of bacterial DNA restriction-modification mechanisms, which guards the organism from foreign DNA invasion [66]. We identified HP O25749 showing positive virulence and exhibiting similarity with tetratricopeptide repeat (TPR) protein. TPR is a signature motif of proteins regulating protein-protein interaction and the formation of multiprotein complexes [67]. Proteins with TPR motifs are involved in important biological processes such as cell cycle, protein folding, transcriptional regulation, etc. [68]. Involvement in leading processes makes them liable to be treated as potential drug targets. We found two HPs O25618 and O25619 are showing significant similarity to dynamin like GTPases. The function of dynamin GTPases is well studied in eukaryotes. They are involved in membrane fusion and fission mediated by the hydrolysis of GTP molecules but the exact function of their prokaryotic counterparts, despite the existence of structural data, is not well understood and needs a further probe to straighten out their role in prokaryotes [69]. We have identified 23 HPs may be involved in diverse cellular processes and regulatory mechanisms. Proteins mediating the formation of cell envelope such flagellar biosynthesis proteins, flagellar motility proteins are signature members of this group. Flagella is responsible for bacterial motility in a host environment which helps in the colonization of the pathogen [70]. H. pylori is equipped with "five to seven unipolar" flagella that are protected against gastric acidity due to the presence of a covering sheath formed of phospholipids [71]. There are a relatively higher number of genes in H. pylori that encodes flagellar proteins supporting the fact that motility facilitates the colonization of the pathogen in the host body; thus, their association with bacterial virulence is also subjected to consideration in the course of drug discovery. HP O26095 is showing similarity with flagellar biosynthetic protein flhb that mediates the formation of flagella. It may be a potential drug target. We found HP O25564 similar to flagellar hook-length control protein FliK that controls the length of the flagellar hook during flagellar biosynthesis [72]. In the H. pylori genome, there are seven known genes encoding molecular chaperons. We have identified HP O25894 is showing homology with molecular chaperon. DnaJ, is signature member of the family of molecular chaperons that exhibit a diverse number of molecular functions such ATP binding, metal ion binding, unfolded protein binding and is involved in a number of leading biological processes like protein folding, protein unfolding, DNA replication, and response to heat shock, etc. [73]. The involvement of chaperons in essential cellular processes required for survival and propagation of pathogen make them potential drug targets for the development of effective drugs against pathogenicity. Though we have categorized HPs in the definite functional classes on the basis of their molecular functions and their involvement in diverse biological processes, but there HPs which exhibit some unique functions or functions are not clearly classified in the available literature. We put those HPs in the group of proteins exhibiting miscellaneous functions. HP O25579 is identified as toxin like outer membrane protein and showing significant virulence in virulence factors analysis. We found HP O25993 similar to lipoprotein with positive virulence. Despite the fact that H. pylori infects the host in the free environment, evidence for adherence to epithelial cells of the gastric tissues of the host are also found [64]. Outer membrane proteins and lipoproteins have an effective role in cell adhesion in H. pylori [5]; thus, they may be taken as strong candidates for drug targets. We identified HP O25713 similar to neuraminyllactose-binding hemagglutinin (NLBH) with substantial virulence. In H. pylori, NLBH, which is also a lipoprotein, has an effective role in adhesion to the gastric epithelium of the host [74]. We identified three characterized genes in the H. pylori genome that encodes NLBH proteins at distant locations. The specific function of NLBH signifies its virulence making it a potential therapeutic target. Virulence factors As discussed in the last section, we have performed virulence factors analysis for all the 340 HPs to bring about virulent proteins that play an effective role in the propagation of disease. We preferable selected consensus-based approach for the purpose of taking the results of both predictors VICMpred and VirulentPred as positive. Thus, we found 22 HPs predicted virulent by both these programs (Table 2). While looking for the virulent proteins in the array of 104 predicted HPs, we found 11 HPs showing positive virulence that are mentioned in Table 1. Concrete specification of virulence proteins amongst the predicted functional candidates paves the way for further studies on drug discovery and development in a more focused way. Therefore, results of virulence factors analysis hold significant in the lookups for further study and experimental characterization of predicted HPs. Subcellular localization Identification of subcellular location of the protein in a computer based functional analysis is significant because there is a strong relation between the function and location of the protein in cellular space [75, 76]. It also gives insight into the determination of probable drug target or vaccine target among the identified virulent proteins. For the newly assigned 104 HPs, we deduced their relative subcellular locations from the results of subcellular localization prediction discussed earlier on the basis of consensus-based approach. Relative subcellular locations of predicted HPs are given in Table 1. We also classified the predicted HPs based on their subcellular locations (Fig. 2). Associating the results of subcellular localization with those of virulence factors analysis may help in the identification of probable drug or vaccine targets. In conclusion, computational sequence analysis of HPs in order to find out possible functional clues is an extensive work and need much patience for each gene is individually analyzed with an array of tools and databases. The inferences are drawn with a sensitive approach to discard the possibilities of false-positives. Due to the occurrence of similar looking patterns, prediction software may predict different function for similar HP than that predicted by another tool. Therefore, we have selected a more sensitive consensusbased approach, cross-checking the results of all used programs and then deducing inferences on the basis of majority rule. Majority rule is the criteria taking the function predicted by four or more tools as the probable function of the HP. This way, we have successfully predicted probable functions of 104 HPs with high level confidence. A wide range of HPs showing functional similarities with the proteins those play an essential role in bacterial pathogenesis. The study may pave the way for experimentalists to look forward to the possibilities of in vitro functional characterization of virulent proteins that may be considered potential therapeutic targets in the process of drug discovery. Fig. 1 Classification of hypothetical proteins into enzymes (n=27), transporters (n=10), binding proteins (n=26), cellular processes/regulatory proteins (n=23) and miscellaneous functions (n=18). Fig. 2 Classification of HPs on the basis of subcellular localization. Table 1 List of 104 HPs with predicted functions from Helicobacter pylori aHypothetical proteins (HPs) predicted virulent in virulence factors analysis. Table 2 List of predicted virulent proteins from Helicobacter pylori

Document structure show

Annnotations

blinded