Methods Data abstraction In this study, the primary source of genome data is National Center for Biotechnology Information (NCBI) genome database. We extracted preliminary information using "Helicobacter pylori" string that redirects to the genomewide project report of H. pylori genome. We selected H. pylori 26695 strain from the database with the Accession Code RefSeq NC_000915.1. The genome contains 1,555 genes coding for 1,445 proteins. We then extracted the hypothetical proteins from the pool of 1,445 proteins. We used Uniprot for retrieving Uniprot IDs and fasta sequences of the HPs using their Protein Product IDs (e.g., NP_206816.1). Fasta sequences of the HPs retrieved from Uniprot were used for further analysis. Physicochemical parameterization Physicochemical properties of the proteins help in deducing the biochemical characteristics of the proteins and functional characterization. We used Expassy's ProtParam [24] server for estimation of physicochemical parameters of the HPs. ProtParam server is equipped with modules that are capable of predicting an array of physicochemical properties using predefined formulas and experimental inferences. We predicted relative molecular weight, theoretical pI, extinction coefficient, instability index, aliphatic index, and grand average of the hydropathicity of the HPs using ProtParam. All these properties help in identifying the probable function of the proteins. Data for physicochemical parameterization are listed in Supplementary Table 1. Sub-cellular localization The function of a protein is very well influenced by its location in the cellular space. For instance, proteins of the exoproteomic pool and secretory proteins often play an essential role in virulence related activities such as adherence to host cells. We used an array of tools to carry out the subcellular localization of the HPs. We used PSORTb [25], PSLpred [26], and Cello [27] to predict the location of HPs in the cell. These predictors use experimental data from known proteins to make predictions for query proteins using their fasta sequences. They predict the possible occurrence of protein in diverse cellular or extracellular localities such as cytoplasm, periplasm, inner membrane, outer membrane, or extracellular space. To predict signal peptides in the HPs, we used SignalP [28] prediction platform for the existence of signal peptides in the HPs, which is a characteristic feature of membrane-bound proteins. SecretomeP [29] server was used to find out nonclassical secretory proteins among the HPs. Prediction of TMHs in the proteins helps in the identification of membranous proteins. We used HMMTOP [30] and TMMHMM [31] for this purpose. Both these programs use Hidden Markov Model (HMM) profiles of training data set to predict TMHs in query sequences. The supplementary data are given in Supplementary Table 2. Identification of virulence proteins The present work put stress upon the identification of potential virulence proteins in the pool of HPs. Pathogenic bacteria contain a range of virulence proteins in their pathogenesis machinery. There are adhesins, exotoxins, endotoxins, and secretion systems, etc., that comprise the virulence moiety of pathogenic bacteria. We used VirulentPred [32] and VICMpred [33] for the identification of virulence factors among the HPs. Both these tools are Support Vector Machine (SVM) based using 5-fold cross-validation processes to validate the results. VirulentPred uses the strategy of two-way predictions, i.e., non-Virulent or Virulent whereas VICMpred categorizes proteins into four classes namely proteins involved in cellular processes, metabolism protein, information molecule, and virulence factors. It has a training set of 670 proteins from Gram-negative bacteria including 70 known virulence factors. Information for virulence factors analysis is provided in Supplementary Table 3. Homology and function prediction The assertion of homology between proteins derived on the basis of sequence similarity provides insights into the functional properties of an unknown protein showing similarity with a protein of known function. BLAST [34, 35, 36, 37] is a commonly used and most reliable tool for the purpose. Structure and function prediction help to identify novel drug targets which can be further utilized for therapeutic intervention [38, 39, 40, 41, 42, 43, 44, 45, 46]. We used blastp module to search for homologous proteins to the HPs against a database of nonredundant protein sequences. To decrease the redundancy in the results, a threshold was set for the e-value less than 0.0005 and sequence identity more than 30%. SMART [47] was used for the function prediction. It uses information about domain architecture from known proteins and provides functional annotation of query sequences. Function prediction based on motif discovery was performed using InterProScan [48] and Motif. InterProScan searches the query sequence against Interpro consortium to bring about the function of the proteins using motif information. Motif operates as an interface between user and motif library of known databases. It searches the query sequence against Pfam, TIGRFAM, COG, SMART, PROSITE Patterns, and PROSITE profiles. The user has the facility to choose any of these databases. We also used STRING [49] to predict protein-protein interaction networks for the HPs. It gives functional insights for the HPs based on protein-protein interaction. Information for homology and function prediction is listed in Supplementary Table 4. Classification and domain assignment Protein classification and domain assignment using sequence similarity search may give ample evidence for function prediction of the HPs. We have used an array of databases and retrieval tools such as CATH [50], SUPERFAMILY [51], PANTHER [52], Pfam [53], CDART [54], SVMProt [55], and ProtoNet [56] for the classification of the HPs. CATH provides the classification of Protein Data Bank (PDB) protein structure repository. CATH v4.0 release contains 235,858 domains, 2,738 superfamily and 69,058 annotated PDBs. SUPERFAMILY database provides structure and functional annotation of proteins based on HMM using SCOP classification system. PANTHER is another efficient protein classification database based on HMM profiles. PANTHER provides a multi-way classification of proteins on the basis of family and subfamily, molecular function, involvement in a biological process, and association with a pathway in any cellular process. It reduces the risk of redundancy by applying strict HMM scoring strategy. We also used Pfam for the classification of HPs. Pfam is a database of protein families with representative multiple sequence alignments and HMMs for each family. SVMProt was also used for functional classification of the HPs. It is a SVM based classification software trained with the dataset of about 54 functional families of protein. We performed cluster-based classification of the HPs using ProtoNet. It gives a hierarchical classification of proteins using clusters of proteins showing functional similarity. The information about the classification of the HPs is given in Supplementary Table 5.