2_test  

PMC:540041 JSONTXT 3 Projects

GenDiS: Genomic Distribution of protein structural domain Superfamilies Abstract Several proteins that have substantially diverged during evolution retain similar three-dimensional structures and biological function inspite of poor sequence identity. The database on Genomic Distribution of protein structural domain Superfamilies (GenDiS) provides record for the distribution of 4001 protein domains organized as 1194 structural superfamilies across 18 997 genomes at various levels of hierarchy in taxonomy. GenDiS database provides a survey of protein domains enlisted in sequence databases employing a 3-fold sequence search approach. Lineage-specific literature is obtained from the taxonomy database for individual protein members to provide a platform for performing genomic and phyletic studies across organisms. The database documents residual properties and provides alignments for the various superfamily members in genomes, offering insights into the rational design of experiments and for the better understanding of a superfamily. GenDiS database can be accessed at http://www.ncbs.res.in/~faculty/mini/gendis/home.html. INTRODUCTION High-throughput large-scale sequencing efforts have illustrated the enormous diversity embedded within genomes owing to varied composition of the proteome. Fortunately, structural and sequence analyses suggest strong convergence, indicating that many proteins will share limited number of folds (1). Curation of protein structural entries in a hierarchy (2,3), compilation of sequence families (4,5) and superfamilies (6,7), establishing relationships between protein sequence and structural databases (8,9) and the analysis of genomic patterns (10,11) form representative approaches to understand the process of this strong convergence. Reliable association of unannotated protein sequences to pre-existing families of well-characterized structure and function allows the mapping of functionally important residues on sequence alignments that can provide important insights into functional mechanisms. However, similarity and inheritance of function among homologues related in the twilight zone have to be considered after careful validation (12). Genomes are classified into taxons on the basis of morphology and genetic content under the taxonomy database (13). Classification of the organism at various taxonomic strata elaborates diversity among the organisms along with their proteomic content (14). Genome content and distribution of proteins provide better understanding of species phylogeny (15). Exploring the distribution of structural superfamilies across varied strata of taxons provides an addendum into our understanding of proteins and phylogeny of the organism. The database of Genomic Distribution of protein structural domain Superfamilies (GenDiS) aims to provide structural assignments to genes listed within the non-redundant protein sequence database at the superfamily level. Structural superfamily definitions are in correspondence with SCOP 1.63 (16) and PASS2 (17) databases. Search for homologues within the sequence databases have been performed using multiple approaches (see Methods). Assignments have been subsequently validated before inducting a member. Genomic lineage for every individual entry was obtained from the taxonomy database and corresponding taxon records were assigned. The database offers a platform for understanding and comparing the distribution of protein superfamilies across the different taxonomic strata. METHODS Searching for potential superfamily members in sequence databases Potential members of the superfamilies have been searched using a 3-fold approach. Members of PASS2 database (17) have been queried in April 2003 release of non-redundant sequence database (13) employing PSI-BLAST (18) setting an expectation value of 10−3 for 20 iterations. The profile-to-sequence searches were complemented employing the HMMsearch tool of the HMMER suite (19). Hidden Markov models (HMMs) were derived for domain superfamilies starting from structure-based sequence alignments of PASS2 members (17) with an expectation threshold of 0.1 during the searches. In addition, motif-constrained PHI-BLAST (20) searches were also carried out as reported previously (21,22) for a single iteration and an expectation value of 1.0. A composite set of domain assignments was obtained for individual superfamilies from these three approaches. The alignment lengths were compared with the query to ensure that it corresponds to the full length of PASS2 domains (23) (Figure 1). Redundant proteins were removed employing CD-HIT (24) at a stringent sequence identity cut-off of 100%. Domains assigned to a superfamily belonging to a genome were aligned using CLUSTALW (25). The alignments have been colour-coded by examining the conservation and similarity at the various positions. Taxonomic annotation of the superfamily members and alignments Non-redundant sequences, maintained in the NCBI, form a composite resource of several genome databases. GenDiS records the source organism of the assigned proteins and a detailed taxonomic lineage of the species in correspondence with the taxonomy database (13). Taxonomic classifications at the phyla, class, order, family, genus and species levels have been recorded against individual entries. Proteins belonging to similar taxons are clustered together and further sub-grouped at the superfamily level (Figure 1). TOOLS AND SERVICES AT THE GenDiS SERVER GenDiS database can be navigated through a user-friendly search engine to obtain relevant information on taxonomic and superfamily distribution. The database has been linked to taxonomy and other protein databases. GenDiS server provides several useful tools for performing genome and cross-genome analysis. Information about superfamily members The presence of superfamily members at the different taxonomic levels is summarized. Domains of the various superfamilies before and following the validation (pruned set) are downloadable. Domain architecture was identified for validated members of GenDiS employing IMPALA (26) against PASS2 profiles of structural domains. Average domain length, sequence diversity within genomes and at the superfamily level are listed. HMMs can be obtained for the various superfamilies. Genome and taxonomic information The full list of the diverse superfamilies residing at the various taxonomic hierarchies can be retrieved from the database. Information about the occurrences of the various descending taxons within a particular hierarchy level of taxonomy is provided. Completely sequenced genomes have been separately listed and can be browsed through the complete genome list. The number of superfamilies and homologous sequences present in the various genomes can be obtained. Alignments of the members of particular superfamilies within genomes and conserved regions of the alignment are provided. For multi-membered superfamilies, diversity score evaluated by the Makowski and Soares (27) method and the phylogenetic tree obtained on the basis of protein dissimilarity are presented. Domain architectures can also be retrieved at the phyla, class, order and genus levels at the taxonomical hierarchy. Overlap score within genomes Distinction among organisms results from the composite proteome encoded by the genome. Comprehensive structural domain assignments at the proteome level provide opportunities to study the distribution of the common and unique superfamilies among the completely sequenced genomes. The overlap score for a pair of completed genomes along with the listing of common and unique superfamilies demonstrates similarity among the organisms at a more holistic level. Alignments of desired query to superfamilies Options are provided for aligning query sequences to superfamily members within a genome or by performing genome-wide alignments for specific superfamilies. The alignments are performed employing CLUSTALW (25). Assigning structural domain architectures Domain architectural assignments of unannotated sequences elucidate the combination of structural domains embedded within the polypeptide aiding its detailed characterization (28). Structural domains can be assigned to a query sequence by probing against sequence profiles of PASS2 members employing IMPALA (26). CONCLUSION GenDiS is a compendium of sequence domains of evolutionarily related proteins grouped at the superfamily level in direct correspondence with SCOP (16) and PASS2 (17) databases. Furthermore, it is possible to obtain links between structural hierarchy and taxonomic levels at GenDis. Availability of alignments for sequence domains in the various genomes over the World Wide Web facilitates the study and design of experiments on specific superfamilies. The database creates a framework for a systematic survey and analysis of various structural superfamilies. The database may be accessed and downloaded across the World Wide Web (http://caps.ncbs.res.in/gendis/download.html). Associating different proteins with structurally similar and evolutionarily related proteins enhance our functional understanding of a protein superfamily. Complete taxonomic information corresponding to individual sequences in GenDiS database provides a platform for performing cross-genomic or phyletic analysis at various levels of hierarchy in taxonomy. A World Wide Web interface would provide an understanding of the various sequence relatives across the various genomes, their conservation and sequence diversity enhancing our comprehension corresponding to the protein superfamily or an organism.

Document structure show

Annnotations TAB TSV DIC JSON TextAE Lectin_function IAV-Glycan

last updated at 2021-06-10 10:55:40 UTC

  • Denotations: 36
  • Blocks: 0
  • Relations: 0