PMC:539984 JSONTXT

Annnotations TAB JSON ListView MergeView

{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/539984","sourcedb":"PMC","sourceid":"539984","source_url":"http://www.ncbi.nlm.nih.gov/pmc/539984","text":"The SYSTERS Protein Family Database in 2005 \n\nAbstract\nThe SYSTERS project aims to provide a meaningful partitioning of the whole protein sequence space by a fully automatic procedure. A refined two-step algorithm assigns each protein to a family and a superfamily. The sequence data underlying SYSTERS release 4 now comprise several protein sequence databases derived from completely sequenced genomes (ENSEMBL, TAIR, SGD and GeneDB), in addition to the comprehensive Swiss-Prot/TrEMBL databases. The SYSTERS web server (http://systers.molgen.mpg.de) provides access to 158 153 SYSTERS protein families. To augment the automatically derived results, information from external databases like Pfam and Gene Ontology are added to the web server. Furthermore, users can retrieve pre-processed analyses of families like multiple alignments and phylogenetic trees. New query options comprise a batch retrieval tool for functional inference about families based on automatic keyword extraction from sequence annotations. A new access point, PhyloMatrix, allows the retrieval of phylogenetic profiles of SYSTERS families across organisms with completely sequenced genomes.\n\nINTRODUCTION\nThe principal goal of the SYSTERS project is to automatically partition all the available protein space. Because the fully automated classification scheme does not rely on interventions and updates by experts, the SYSTERS approach is complementary to expert-curated protein domain or protein family classification schemes like Pfam (1), SMART (2) or PROSITE (3). The SYSTERS database is derived from rigorous all-against-all Smith–Waterman searches (4). The resulting pairwise sequence similarities are used in a refined two-step clustering approach that assigns each protein to a family and a superfamily (A. Krause, J. Stoye and M. Vingron, submitted for publication).\nThe SYSTERS web resource comprises a multitude of query access points, data retrieval options, pre-processed sequence analyses of individual families and comprehensive views on multiple families (Figure 1). The automatically derived protein families are augmented with expert-curated biological information from various resources. For the functional characterization of each cluster, keywords are extracted from annotations of source sequence databases and are assigned to each family. In SYSTERS release 4, Pfam domain assignments to sequences of Swiss-Prot/TrEMBL (5) help to visualize the domain architecture of a protein and to identify differences in domain composition within a protein family. A special focus of SYSTERS is to support phylogenetic studies of protein families. Sequences of SYSTERS families can be selected and downloaded in multiple ways. The users are offered pre-calculated multiple alignments and phylogenetic trees that can serve as a starting point for their own focused analyses.\nIn this paper, we will describe the differences of SYSTERS release 4 compared to previous releases and highlight the recent developments of tools to access and view information on SYSTERS protein families and superfamilies.\n\nINPUT DATA AND CLUSTERING RESULTS OF SYSTERS RELEASE 4\nThe underlying protein data for SYSTERS release 4 comprise more than 1.1 million sequences. The Swiss-Prot/TrEMBL database content was extended by several protein data sources with information from completely sequenced genomes (Figure 1): Saccharomyces cerevisiae (6), Schizosaccharomyces pombe (7), Arabidopsis thaliana (8), Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, Caenorhabditis briggsae, Takifugu rubripes, Mus musculus and Homo sapiens (9). After removal of redundant sequences, the results of more than 1011 pairwise Smith–Waterman comparisons were fed into the clustering procedure (Table 1).\nThe resulting numbers of SYSTERS superfamilies and protein families are presented in Table 2. Only 11.8% of sequences remained as singletons. The majority (74%) of multi-sequence families are ‘perfect’, meaning that all sequences in a family match with each other. Only 6.5% of the families are classified as ‘overlapping’: these families might harbour protein pairs that do not share homologous regions, but are linked indirectly via an intermediate protein that has distinct homologous regions in common with both. The protein family size is power-law-like distributed (10). There are few families with many sequences and many families with only a few sequences. This result complements earlier findings (11,12) on the mode of protein evolution.\n\nNEW FEATURES AND SERVICES\n\nInformation characterizing a SYSTERS family\nFor each protein family, SYSTERS provides a comprehensive overview of its member proteins and their annotations. On the entry page, users have access to more detailed information on protein annotations, sequences, multiple alignments, phylogenetic analyses, protein domains, taxonomic distribution and gene structure-related data (Figure 1).\nIn addition to pre-calculated multiple alignments by MView (13), the SYSTERS web server now offers multiple alignments and UPGMA trees generated using DIALIGN (14). The DIALIGN alignment incorporates all sequences in full length, colour-coded information on alignment quality and Pfam domain positions. From MView alignments we derived consensus sequences for each family. The database of consensus sequences can be queried by the user via BLAST (15) interface. SYSTERS provides a new wizard-like tool that allows a flexible selection of user-defined sequences. In this way, users can compile sequences of different SYSTERS families or user-supplied sequences. Subsequently, multiple alignment and UPGMA trees can be constructed using DIALIGN and viewed online.\nWe extracted frequently occurring keywords from all original protein annotations of a SYSTERS family. The keyword list represents a succinct functional description of a family, thus helping to infer functions of hypothetical proteins. We integrated further Swiss-Prot/TrEMBL annotations such as Gene Ontology (16) terms, InterPro (17) terms and Enzyme Commission (EC) numbers (18) that support function inference. A new batch-retrieval tool allows fast annotation of large protein sets. Users can supply a list of sequence database identifiers, e.g. from SWISS-PROT, and are offered to download a list of associated SYSTERS protein family IDs, extracted keywords and GO terms.\nProtein domain positions of all Swiss-Prot/TrEMBL proteins as annotated in the Pfam database are now integrated into SYSTERS. Domain architectures of all proteins in a SYSTERS family are visualized and can easily be compared. This allows to pinpoint differences in domain architectures within the family that might indicate lineage-specific domain acquisitions or losses.\n\nTaxonomy and phylogenetic profiling\nWe have integrated the taxonomic system as maintained by the NCBI (19) into SYSTERS and offer to visualize the distribution of protein family members over the taxonomic tree. This now allows users to select sequences of a subfamily specified by internal nodes of the taxonomic tree for further analysis. Additionally, it is possible to select all SYSTERS protein families that have (at least one/exclusively) member protein(s) within a user-defined taxonomic range.\nA special taxonomic view of a protein family focuses on the presence/absence patterns of member proteins across organisms, also known as phylogenetic profiles (20). Similar profiles often point to similar cellular function or a physical interaction. We set up PhyloMatrix, an extension of SYSTERS to phylogenetic profiling. PhyloMatrix profiles are based on the representation of 106 completely sequenced organisms in SYSTERS protein families, 78 bacteria, 12 eukaryota and 16 archaea. We found 7563 different profiles for 19 374 protein families under the constraint that at least three organisms be present in a family. Users can define a list of protein family IDs to retrieve a set of profiles. Alternatively, PhyloMatrix can be queried with a specific organism pattern to display profiles of matching families. PhyloMatrix is a helpful tool for the exploration of evolutionary events. For example, Figure 2 shows profiles of ribosomal protein families. These are complementary for mitochondrial and cytosolic forms reflecting the endosymbiotic origin of mitochondria.\n\nCross-references to external databases\nThe SYSTERS web server augments information on sequences and protein families by links to a multitude of data resources. We reference all protein source databases (Figure 1). In addition, SYSTERS can be queried with gene names, with accessions from the EMBL nucleotide database (21) or with identifiers of the specialized structure databases, such as PDB (22), MSD (23) and IMB (24). SYSTERS is embedded in the network of genomic database resources in the Computational Molecular Biology Department of the Max Planck Institute for Molecular Genetics, Berlin, including GeneNest, SpliceNest (25) and CORG (26). ","divisions":[{"label":"Title","span":{"begin":0,"end":44}},{"label":"Abstract","span":{"begin":55,"end":1166}},{"label":"Body","span":{"begin":1167,"end":8967}},{"label":"Section","span":{"begin":1167,"end":3083}},{"label":"Title","span":{"begin":1167,"end":1179}},{"label":"Section","span":{"begin":3085,"end":4514}},{"label":"Title","span":{"begin":3085,"end":3139}},{"label":"Section","span":{"begin":4516,"end":8966}},{"label":"Title","span":{"begin":4516,"end":4541}},{"label":"Section","span":{"begin":4543,"end":6739}},{"label":"Title","span":{"begin":4543,"end":4586}},{"label":"Section","span":{"begin":6741,"end":8315}},{"label":"Title","span":{"begin":6741,"end":6776}},{"label":"Section","span":{"begin":8317,"end":8965}},{"label":"Title","span":{"begin":8317,"end":8355}}],"tracks":[{"project":"2_test","denotations":[{"id":"15608183-14681378-76782995","span":{"begin":1513,"end":1514},"obj":"14681378"},{"id":"15608183-14681379-76782996","span":{"begin":1524,"end":1525},"obj":"14681379"},{"id":"15608183-14681377-76782997","span":{"begin":1539,"end":1540},"obj":"14681377"},{"id":"15608183-7265238-76782998","span":{"begin":1630,"end":1631},"obj":"7265238"},{"id":"15608183-12520024-76782999","span":{"begin":2418,"end":2419},"obj":"12520024"},{"id":"15608183-14681421-76783000","span":{"begin":3405,"end":3406},"obj":"14681421"},{"id":"15608183-14681429-76783001","span":{"begin":3436,"end":3437},"obj":"14681429"},{"id":"15608183-12519987-76783002","span":{"begin":3462,"end":3463},"obj":"12519987"},{"id":"15608183-14681459-76783003","span":{"begin":3609,"end":3610},"obj":"14681459"},{"id":"15608183-12432406-76783004","span":{"begin":4474,"end":4476},"obj":"12432406"},{"id":"15608183-9580988-76783005","span":{"begin":4477,"end":4479},"obj":"9580988"},{"id":"15608183-9632837-76783006","span":{"begin":4989,"end":4991},"obj":"9632837"},{"id":"15608183-10222408-76783007","span":{"begin":5089,"end":5091},"obj":"10222408"},{"id":"15608183-9254694-76783008","span":{"begin":5376,"end":5378},"obj":"9254694"},{"id":"15608183-14681407-76783009","span":{"begin":6001,"end":6003},"obj":"14681407"},{"id":"15608183-12520011-76783010","span":{"begin":6022,"end":6024},"obj":"12520011"},{"id":"15608183-10592255-76783011","span":{"begin":6068,"end":6070},"obj":"10592255"},{"id":"15608183-14681353-76783012","span":{"begin":6844,"end":6846},"obj":"14681353"},{"id":"15608183-10200254-76783013","span":{"begin":7403,"end":7405},"obj":"10200254"},{"id":"15608183-14681351-76783014","span":{"begin":8635,"end":8637},"obj":"14681351"},{"id":"15608183-10592235-76783015","span":{"begin":8712,"end":8714},"obj":"10592235"},{"id":"15608183-14681397-76783016","span":{"begin":8722,"end":8724},"obj":"14681397"},{"id":"15608183-11752308-76783017","span":{"begin":8735,"end":8737},"obj":"11752308"},{"id":"15608183-11752319-76783018","span":{"begin":8947,"end":8949},"obj":"11752319"},{"id":"15608183-12519946-76783019","span":{"begin":8961,"end":8963},"obj":"12519946"}],"attributes":[{"subj":"15608183-14681378-76782995","pred":"source","obj":"2_test"},{"subj":"15608183-14681379-76782996","pred":"source","obj":"2_test"},{"subj":"15608183-14681377-76782997","pred":"source","obj":"2_test"},{"subj":"15608183-7265238-76782998","pred":"source","obj":"2_test"},{"subj":"15608183-12520024-76782999","pred":"source","obj":"2_test"},{"subj":"15608183-14681421-76783000","pred":"source","obj":"2_test"},{"subj":"15608183-14681429-76783001","pred":"source","obj":"2_test"},{"subj":"15608183-12519987-76783002","pred":"source","obj":"2_test"},{"subj":"15608183-14681459-76783003","pred":"source","obj":"2_test"},{"subj":"15608183-12432406-76783004","pred":"source","obj":"2_test"},{"subj":"15608183-9580988-76783005","pred":"source","obj":"2_test"},{"subj":"15608183-9632837-76783006","pred":"source","obj":"2_test"},{"subj":"15608183-10222408-76783007","pred":"source","obj":"2_test"},{"subj":"15608183-9254694-76783008","pred":"source","obj":"2_test"},{"subj":"15608183-14681407-76783009","pred":"source","obj":"2_test"},{"subj":"15608183-12520011-76783010","pred":"source","obj":"2_test"},{"subj":"15608183-10592255-76783011","pred":"source","obj":"2_test"},{"subj":"15608183-14681353-76783012","pred":"source","obj":"2_test"},{"subj":"15608183-10200254-76783013","pred":"source","obj":"2_test"},{"subj":"15608183-14681351-76783014","pred":"source","obj":"2_test"},{"subj":"15608183-10592235-76783015","pred":"source","obj":"2_test"},{"subj":"15608183-14681397-76783016","pred":"source","obj":"2_test"},{"subj":"15608183-11752308-76783017","pred":"source","obj":"2_test"},{"subj":"15608183-11752319-76783018","pred":"source","obj":"2_test"},{"subj":"15608183-12519946-76783019","pred":"source","obj":"2_test"}]}],"config":{"attribute types":[{"pred":"source","value type":"selection","values":[{"id":"2_test","color":"#93bbec","default":true}]}]}}