PMC:15027 / 2576-8317
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"11178258-10521334-44607466","span":{"begin":480,"end":481},"obj":"10521334"},{"id":"11178258-9790834-44607467","span":{"begin":482,"end":483},"obj":"9790834"},{"id":"11178258-10835597-44607468","span":{"begin":484,"end":485},"obj":"10835597"},{"id":"11178258-10366660-44607469","span":{"begin":1399,"end":1400},"obj":"10366660"},{"id":"11178258-8851977-44607470","span":{"begin":1401,"end":1402},"obj":"8851977"},{"id":"11178258-9206831-44607471","span":{"begin":1661,"end":1662},"obj":"9206831"},{"id":"11178258-9537411-44607472","span":{"begin":1663,"end":1664},"obj":"9537411"},{"id":"11178258-10592173-44607473","span":{"begin":1933,"end":1934},"obj":"10592173"},{"id":"11178258-10592199-44607474","span":{"begin":1945,"end":1947},"obj":"10592199"},{"id":"11178258-10592234-44607475","span":{"begin":2230,"end":2232},"obj":"10592234"},{"id":"11178258-9381173-44607476","span":{"begin":2873,"end":2875},"obj":"9381173"},{"id":"11178258-5449325-44607477","span":{"begin":3018,"end":3020},"obj":"5449325"},{"id":"11178258-8748022-44607478","span":{"begin":3021,"end":3023},"obj":"8748022"},{"id":"11178258-10592175-44607479","span":{"begin":4435,"end":4437},"obj":"10592175"},{"id":"11178258-10382966-44607480","span":{"begin":4672,"end":4674},"obj":"10382966"}],"text":"Background\nFunctional annotation of genomes is a critical aspect of the genomics enterprise. Without reliable assignment of gene function at the appropriate level of specificity, new genome sequences are plainly useless. The primary methodology used for genome annotation is the sequence database search, the results of which allow transfer of functional information from experimentally characterized genes (proteins) to their uncharacterized homologs in newly sequenced genomes [1,2,3]. However, general-purpose, archival sequence databases are not particularly suited for the purpose of genome annotation. The quality of the annotation of a new genome produced using a particular database critically depends on the reliability and completeness of the annotations in the database itself. As far as annotation is concerned, the purpose of primary sequence databases is to faithfully preserve the description attached to each sequence by its submitter. In their capacity as sequence archives, such databases include no detailed documentation in support of the functional annotations. Furthermore, primary sequence databases are not explicitly structured by either evolutionary or functional criteria. These features, which are inevitable in archival databases, seriously impede their utility as resources for genome annotation, particularly when an automated or semi-automated approach is attempted [4,5]. At its worst, this situation results in a notorious vicious circle of error amplification - an inadequately annotated database is used to produce an error-ridden and incomplete annotation of a new genome, which in turn makes the database even less useful [6,7,8].\nOne way out of this 'Catch-22' situation is to use a different type of database for genome annotation, namely databases in which sequence information is organized by structural, functional or phylogenetic criteria, or a combination thereof. For example, the KEGG [9] and WIT [10] databases are primarily function-oriented and organize protein sequences from completely and partially sequenced genomes according to their known or predicted roles in biochemical pathways, although WIT also provides a phylogenetic classification. In contrast, the SMART database [11] is organized on a structural principle and provides a searchable collection of common protein domains. All these databases share a fundamental common feature - they encapsulate carefully verified knowledge on protein structure, function and/or evolutionary relationships, and therefore, at least in principle, provide for a more robust mode of genome annotation than general-purpose databases and may serve as a stronger foundation for partially automated approaches to genome analysis.\nThe database of Clusters of Orthologous Groups of proteins (COGs) is a phylogenetic classification of proteins encoded in completely sequenced genomes [12]. An attempt has been made to organize these proteins into groups of orthologs, direct evolutionary counterparts related by vertical descent [13,14]. Because of lineage-specific duplications, orthologous relationships in many cases exist between gene (protein) families, rather than between individual proteins, hence 'orthologous groups' (including only lineage-specific duplications in a COG is the principle of this analysis; in practice, because of insufficient resolution of sequence comparisons, certain COGs may include ancestral duplications). The principal phylogenetic classification in the COG database is overlaid with functional classification and annotation based on detailed sequence and structure analysis and published experimental data. The COG system has been designed as a platform for evolutionary analyses and for phylogenetic and functional annotation of genomes. The COGNITOR program associated with the COGs allows one to fit new proteins into existing COGs. The central tenet of this analysis is that, if it can be shown that the protein under analysis is an ortholog of functionally characterized proteins from other genomes, this functional information can be transferred to the analyzed protein with considerable confidence. In addition to COGNITOR, the COG system includes certain higher-level functionalities, such as analysis of phylogenetic patterns and co-occurrence of genomes in COGs. The current (as of 1 June, 2000) system consists of 2,112 COGs that encompass about 27,000 proteins from 21 completely sequenced genomes [15].\nHere we describe the application of the COGs to the systematic annotation and evolutionary analysis of two recently sequenced archaeal genomes, those of the euryarchaeon Pyrococcus abyssi [16] and the crenarchaeon Aeropyrum pernix [17]. These genomes were selected to compare the utility of the COGs for the annotation of two types of genomes - one that is closely related to another genome already included in the system, as Pyrococcus abyssi is to P. horikoshii, and one that represents a group previously not covered by the COGs, the Crenarchaeota. We show here the relatively low error rate of the COG-assisted analysis and its contribution to a significant number of new functional predictions. Emphasis is on using the COG approach to identify features of the A. pernix genome that are shared among all Archaea and those that distinguish Crenarchaeota from Euryarchaeota. Thus this work had a dual focus: first, to explore the potential of the COG system for genome annotation; and second, to use the COG approach to reveal important trends in archaeal genome evolution. It should not be construed as a comprehensive analysis of any particular genome or a comprehensive comparative and evolutionary study; addressing each of these tasks would require the use of several additional methodologies."}
Colil
{"project":"Colil","denotations":[{"id":"T1","span":{"begin":1933,"end":1934},"obj":"10592173"},{"id":"T2","span":{"begin":1945,"end":1947},"obj":"10592199"},{"id":"T3","span":{"begin":1663,"end":1664},"obj":"9537411"},{"id":"T4","span":{"begin":2873,"end":2875},"obj":"9381173"},{"id":"T5","span":{"begin":4672,"end":4674},"obj":"10382966"},{"id":"T6","span":{"begin":2230,"end":2232},"obj":"10592234"},{"id":"T7","span":{"begin":3021,"end":3023},"obj":"8748022"},{"id":"T8","span":{"begin":480,"end":481},"obj":"10521334"},{"id":"T9","span":{"begin":482,"end":483},"obj":"9790834"},{"id":"T10","span":{"begin":1399,"end":1400},"obj":"10366660"},{"id":"T11","span":{"begin":1401,"end":1402},"obj":"8851977"},{"id":"T12","span":{"begin":484,"end":485},"obj":"10835597"},{"id":"T13","span":{"begin":3018,"end":3020},"obj":"5449325"},{"id":"T14","span":{"begin":4435,"end":4437},"obj":"10592175"},{"id":"T15","span":{"begin":1661,"end":1662},"obj":"9206831"}],"namespaces":[{"prefix":"_base","uri":"http://pubannotation.org/docs/sourcedb/PubMed/sourceid/"}],"text":"Background\nFunctional annotation of genomes is a critical aspect of the genomics enterprise. Without reliable assignment of gene function at the appropriate level of specificity, new genome sequences are plainly useless. The primary methodology used for genome annotation is the sequence database search, the results of which allow transfer of functional information from experimentally characterized genes (proteins) to their uncharacterized homologs in newly sequenced genomes [1,2,3]. However, general-purpose, archival sequence databases are not particularly suited for the purpose of genome annotation. The quality of the annotation of a new genome produced using a particular database critically depends on the reliability and completeness of the annotations in the database itself. As far as annotation is concerned, the purpose of primary sequence databases is to faithfully preserve the description attached to each sequence by its submitter. In their capacity as sequence archives, such databases include no detailed documentation in support of the functional annotations. Furthermore, primary sequence databases are not explicitly structured by either evolutionary or functional criteria. These features, which are inevitable in archival databases, seriously impede their utility as resources for genome annotation, particularly when an automated or semi-automated approach is attempted [4,5]. At its worst, this situation results in a notorious vicious circle of error amplification - an inadequately annotated database is used to produce an error-ridden and incomplete annotation of a new genome, which in turn makes the database even less useful [6,7,8].\nOne way out of this 'Catch-22' situation is to use a different type of database for genome annotation, namely databases in which sequence information is organized by structural, functional or phylogenetic criteria, or a combination thereof. For example, the KEGG [9] and WIT [10] databases are primarily function-oriented and organize protein sequences from completely and partially sequenced genomes according to their known or predicted roles in biochemical pathways, although WIT also provides a phylogenetic classification. In contrast, the SMART database [11] is organized on a structural principle and provides a searchable collection of common protein domains. All these databases share a fundamental common feature - they encapsulate carefully verified knowledge on protein structure, function and/or evolutionary relationships, and therefore, at least in principle, provide for a more robust mode of genome annotation than general-purpose databases and may serve as a stronger foundation for partially automated approaches to genome analysis.\nThe database of Clusters of Orthologous Groups of proteins (COGs) is a phylogenetic classification of proteins encoded in completely sequenced genomes [12]. An attempt has been made to organize these proteins into groups of orthologs, direct evolutionary counterparts related by vertical descent [13,14]. Because of lineage-specific duplications, orthologous relationships in many cases exist between gene (protein) families, rather than between individual proteins, hence 'orthologous groups' (including only lineage-specific duplications in a COG is the principle of this analysis; in practice, because of insufficient resolution of sequence comparisons, certain COGs may include ancestral duplications). The principal phylogenetic classification in the COG database is overlaid with functional classification and annotation based on detailed sequence and structure analysis and published experimental data. The COG system has been designed as a platform for evolutionary analyses and for phylogenetic and functional annotation of genomes. The COGNITOR program associated with the COGs allows one to fit new proteins into existing COGs. The central tenet of this analysis is that, if it can be shown that the protein under analysis is an ortholog of functionally characterized proteins from other genomes, this functional information can be transferred to the analyzed protein with considerable confidence. In addition to COGNITOR, the COG system includes certain higher-level functionalities, such as analysis of phylogenetic patterns and co-occurrence of genomes in COGs. The current (as of 1 June, 2000) system consists of 2,112 COGs that encompass about 27,000 proteins from 21 completely sequenced genomes [15].\nHere we describe the application of the COGs to the systematic annotation and evolutionary analysis of two recently sequenced archaeal genomes, those of the euryarchaeon Pyrococcus abyssi [16] and the crenarchaeon Aeropyrum pernix [17]. These genomes were selected to compare the utility of the COGs for the annotation of two types of genomes - one that is closely related to another genome already included in the system, as Pyrococcus abyssi is to P. horikoshii, and one that represents a group previously not covered by the COGs, the Crenarchaeota. We show here the relatively low error rate of the COG-assisted analysis and its contribution to a significant number of new functional predictions. Emphasis is on using the COG approach to identify features of the A. pernix genome that are shared among all Archaea and those that distinguish Crenarchaeota from Euryarchaeota. Thus this work had a dual focus: first, to explore the potential of the COG system for genome annotation; and second, to use the COG approach to reveal important trends in archaeal genome evolution. It should not be construed as a comprehensive analysis of any particular genome or a comprehensive comparative and evolutionary study; addressing each of these tasks would require the use of several additional methodologies."}