CORD-19:03476153a6c18a85c98721c557f35b1a09755d14 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

Id Subject Object Predicate Lexical cue
T1 213-462 Epistemic_statement denotes Discussions will be illustrated with a limited number of virus databases, including our own -the Viral Orthologous Clusters (VOCs) database, which forms the core of Virology.ca that supports researchers working with a variety of large dsDNA viruses.
T2 463-730 Epistemic_statement denotes Depending where you are in the world, GenBank, (maintained by the National Center for Biotechnology Information, NCBI, USA), the EMBL Data Library (maintained in the UK) or the DNA Databank of Japan may first come to mind when thinking about bioinformatics databases.
T3 731-854 Epistemic_statement denotes However, these large repositories of genomic, and other, sequence information have a relatively simple flat-file structure.
T4 855-1034 Epistemic_statement denotes When submitting a viral genome sequence to one these databases, the sequence data must be annotated with various information including the positions of coding DNA sequences (CDS).
T5 1035-1133 Epistemic_statement denotes Flat-file databases can only be queried for information that is explicitly stated within the file.
T6 1134-1255 Epistemic_statement denotes For example, queries can be made by keywords in the protein name, but not derivative characteristics such as gene length.
T7 1659-1816 Epistemic_statement denotes When building such a relational database, the designer must define the structure of all of the data records to be stored and their association to each other.
T8 1817-2016 Epistemic_statement denotes The database schema illustrates these relationships and acts as a map to the organization of the data storage locations, that is specific containers (dataset tables) for storing each type of dataset.
T9 2320-2441 Epistemic_statement denotes Various file formats aim to capture this viral sequence data and associated knowledge, including GenBank and XML formats.
T10 2553-2651 Epistemic_statement denotes Similarly, these files can be worked with using standard text editors or word-processing programs.
T11 2652-2842 Epistemic_statement denotes A downside is that minor changes to the format of GenBank files can cause software tools that read these files (parsers) to fail if they have not been programmed to handle these differences.
T12 3153-3307 Epistemic_statement denotes Although XML files are excellent for representing relationships in a form that can be read by a computer, they are more difficult for users to understand.
T13 3596-3823 Epistemic_statement denotes However, if a database is to be useful to the virology research community, then there must be at least one additional component in this system -an easy to use interface to enable a virologist to interact with the database, i.e.
T14 3903-3974 Epistemic_statement denotes Furthermore, the database itself may form part of an analysis pipeline.
T15 4145-4386 Epistemic_statement denotes Rather than the database returning a file of the sequences to be formatted before submission to a program such as ClustalO or MUSCLE for alignment, it might be desirable to have the database send the sequences directly to the alignment tool.
T16 4387-4535 Epistemic_statement denotes The tool can then run the desired function before returning the processed data within a visualization tool that simplifies editing of the final MSA.
T17 4536-4690 Epistemic_statement denotes Although simplification of this process with default parameters is clearly valuable to the user, over-simplification can lead to restricted function, i.e.
T18 4739-4966 Epistemic_statement denotes Therefore, it is valuable to provide a variety of export options allowing users to obtain formats that can be used directly with multiple analysis tools and permit full range of parameter access, as discussed in later sections.
T19 4967-5099 Epistemic_statement denotes Design of the user interface with potential queries and flexibility of data retrieval in mind is key to the utility of the database.
T20 5238-5461 Epistemic_statement denotes Generally, the databases are optimized for two main functionalities: 1) storing and updating structured data with high integrity and 2) providing tools for searching, retrieving, summarizing and possibly analyzing the data.
T21 5753-5863 Epistemic_statement denotes The relationships between these tables are explicitly defined and are not limited to one-to-one relationships.
T22 6010-6125 Epistemic_statement denotes An additional aspect of any discussion on database design is maintenance, which can be viewed from several aspects.
T23 6481-6620 Epistemic_statement denotes A second view of maintenance, which is more integral to database design, concerns the data: 1) how will data be imported into the database?
T24 6621-6672 Epistemic_statement denotes 2) will data need to be edited (updated/corrected)?
T25 6673-6757 Epistemic_statement denotes Again, these questions have a natural division along the lines of virus genome type.
T26 6758-7003 Epistemic_statement denotes For example, some form of automated process is essential to import the thousands of genomes required for an influenza or HCV (small genomes, fixed gene complement) resource, but these small sequences are unlikely to need subsequent modification.
T27 7004-7269 Epistemic_statement denotes In contrast, a resource such as Virology.ca, which deals with large poxviruses, needs to contend with far less genomes, but must handle changes to genome annotation both before and after import into the database due to large, less well characterized, dsDNA genomes.
T28 7270-7502 Epistemic_statement denotes Therefore, while a software script to automatically collect new genome GenBank files and insert them into a database might be feasible for an influenza virus database, this process cannot be used for poxvirus genomes in Virology.ca.
T29 7503-7606 Epistemic_statement denotes The complexity of the latter frequently leads to variations (sometimes errors) in annotation protocols.
T30 7607-7715 Epistemic_statement denotes Furthermore, new discoveries surrounding gene function mean that old genomes must sometimes be re-annotated.
T31 7716-7907 Epistemic_statement denotes Therefore, key to the design of database systems like Virology.ca is the inclusion of a manual, user-friendly database-editing tool that will allow virologists to enter and maintain the data.
T32 7908-8022 Epistemic_statement denotes Another frequent need for database editing in Virology.ca results from the assignment of genes to ortholog groups.
T33 8353-8595 Epistemic_statement denotes Although orthology is simply the prediction of a common ancestor between genes of different species, the extreme diversity among viruses in a single taxonomic family creates difficulties in the accurate assignment of genes to ortholog groups.
T34 9098-9203 Epistemic_statement denotes Unfortunately, this work is sometimes viewed as a non-discovery based science and funding is problematic.
T35 9380-9561 Epistemic_statement denotes Although many bioinformatics databases may be in some way relevant to viruses, the majority of virologists tend to think of, and also use most frequently, genome sequence databases.
T36 9562-9668 Epistemic_statement denotes As a result, it would be hard to find a virologist who hasn't performed a BLAST search of these databases.
T37 9669-9892 Epistemic_statement denotes Similarly, any virologist who has partially or completely sequenced a viral genome is familiar with the process of annotating the sequence for submission to GenBank (or local equivalent) before their paper can be published.
T38 9893-10189 Epistemic_statement denotes Three levels of annotated information exist: basic information within a GenBank file that can be directly transferred into databases; values calculated or predicted from the basic data; and curated information, beyond that provided when first developing the GenBank file, added manually by users.
T39 10190-10429 Epistemic_statement denotes The GenBank sequence file usually contains basic virus identification information, affiliations of the annotator, CDS locations within the genome, metadata, gene and product type/names, as well as mRNA splicing information, if appropriate.
T40 10549-10747 Epistemic_statement denotes However, the sequence data itself can be used to calculate various values that might be required by the end-user such as A + T content, protein pI, molecular mass of proteins and amino acid content.
T41 10748-10928 Epistemic_statement denotes Additionally, the data may be used as input for predictive tools, such as those that look for functional motifs that are associated with enzymatic function or protein localization.
T42 10929-10985 Epistemic_statement denotes Manual curation can also provide additional information.
T43 10986-11207 Epistemic_statement denotes For example, there is generally no information describing the likely reliability of the annotations or the source of the information (experimental data, software prediction or scientific literature) within a GenBank file.
T44 11277-11431 Epistemic_statement denotes Unfortunately, it is rare that this kind of supplementary information is incorporated into databases in a manner that can be searched in a meaningful way.
T45 11432-11610 Epistemic_statement denotes Keeping record of the reliability of annotations is especially important when investigating the genes of unknown function, frequently found within the varied large dsDNA viruses.
T46 11611-11729 Epistemic_statement denotes Functional assignment of the associated proteins relies heavily on the transfer of function from orthologous proteins.
T47 11730-11951 Epistemic_statement denotes Orthologous proteins, which may only be 20-30% identical, often do not match the same set of functional motifs, and vary in their ability to match distantly-related proteins that might give important clues as to function.
T48 12074-12175 Epistemic_statement denotes As noted above, accurate genome annotation is critical to providing a useful bioinformatics database.
T49 12176-12278 Epistemic_statement denotes However, this must be built upon a foundation of accurate, and preferably complete, genome sequencing.
T50 12279-12373 Epistemic_statement denotes The complexity of the annotation problem increases exponentially with genome size and novelty.
T51 12558-12968 Epistemic_statement denotes Large viruses, such as these, bring two additional complications to the annotation problem: 1) only a subset of the viral genes are essential, so some orthologs may be functional in one species but fragmented in another and, 2) some of the more diverse viruses have > 100 genes currently annotated as "hypothetical", and even the DNA polymerase protein may share only 32% aa identity with its closest relative.
T52 13125-13370 Epistemic_statement denotes Without the use of either a closely related genome, or multiple more distant relatives with similar gene sets, for reference genomes in annotation, it is very difficult to predict which ORFs are: true genes, gene fragments, or random small ORFs.
T53 13486-13623 Epistemic_statement denotes Although some groups share common physical virion structures, often only a few genes are found in common between related groups of phage.
T54 13783-13870 Epistemic_statement denotes Thus, the complexity of annotation relates directly to viral genome size and diversity.
T55 13993-14121 Epistemic_statement denotes Additionally, this process can often be fully automated as part of a pipeline to import newly sequenced genomes into a database.
T56 14279-14542 Epistemic_statement denotes The IRD pipeline aligns input nucleotide sequences against a consensus sequence profile to "identify possible sequencing errors, determine the influenza type, segment number, and for segments 4 and 6 of type A, the subtype, and translate the nucleotide sequence".
T57 15226-15430 Epistemic_statement denotes Both partial matches and potential novel genes are flagged in the GATU interface for the annotator to investigate further and make a final decision; tools are included in GATU to make this process easier.
T58 15550-15722 Epistemic_statement denotes The majority of these tools depend upon similarity searches against previously characterized genes, which can result in underannotation of unique or highly divergent genes.
T59 15723-15832 Epistemic_statement denotes For this reason, the use of a general list of genes may be more useful than using a specific reference virus.
T60 15833-16067 Epistemic_statement denotes Some annotation tools, especially for prokaryotic and eukaryotic genomes, attempt gene prediction -searching for promoter-like sequences and other gene characteristics, however, these are less useful for the annotation of viral genes.
T61 16259-16345 Epistemic_statement denotes This is invaluable in the annotation of large and complex viruses, such as poxviruses.
T62 16682-16795 Epistemic_statement denotes Although this change might alter a protein's function, most annotators would probably annotate this altered gene.
T63 16938-17218 Epistemic_statement denotes In this example the greatly shortened protein is very unlikely to be functional, however, because the second part of the gene is still present and represents 70% of the original gene this is sometimes annotated (Cii) even though it is very unlikely to be translated from the mRNA.
T64 17349-17444 Epistemic_statement denotes Although the protein product is not changed, the larger ORF is sometimes erroneously annotated.
T65 17445-17595 Epistemic_statement denotes These annotation errors illustrate the need for more complex automated evaluation of annotation results or the continued input of the human annotator.
T66 17596-17771 Epistemic_statement denotes Although the various databases and associated sequence searching/analysis tools comprise a very valuable set of resources, it is also important to recognize their limitations.
T67 17772-17931 Epistemic_statement denotes Researchers need to remain wary of the various types of information because regardless of the source, errors and out-of-date files can fall through the cracks.
T68 18086-18220 Epistemic_statement denotes If available, reference genomes can be very valuable and be used to predict/correct potential errors either automatically or manually.
T69 18424-18599 Epistemic_statement denotes However, many of the genes present in the large DNA viruses such as poxviruses are non-essential; they can be, and often are, truncated in some viruses and complete in others.
T70 18600-18697 Epistemic_statement denotes Thus, researchers must remain aware of potential errors, regardless of the source or genome type.
T71 18941-19059 Epistemic_statement denotes • Errors of ignorance -a simple lack of knowledge: ortholog/paralog relationships not understood or viruses mis-named.
T72 19295-19386 Epistemic_statement denotes Another limitation of most databases is the absence of evidence for particular annotations.
T73 19497-19637 Epistemic_statement denotes However, even when evidence notes are present, there may be problems since experiments may eventually be discredited and/or require updates.
T74 19638-19758 Epistemic_statement denotes Curating such a system, essentially with the annotations as a living document, would also be incredibly labor intensive.
T75 19759-19965 Epistemic_statement denotes The lack of an unambiguous and controlled naming standard that is carried across all viruses and databases results in variable descriptions and data that are difficult to query for specific characteristics.
T76 19966-20139 Epistemic_statement denotes For example, software may not be programmed to recognise synonymous terms such as "ssDNA," "single strand DNA," "single-stranded DNA," and "single stranded DNA" as the same.
T77 20140-20288 Epistemic_statement denotes Where possible, import systems should have locks on permitted words, based on a universal, controlled vocabulary, in order to reduce this ambiguity.
T78 20668-20877 Epistemic_statement denotes In addition, although some terms will still be applicable to only particular viruses (eg, the phages, see viral head-tail joining; GO:0098005), GO viral terms are designed to be species-neutral where possible.
T79 20878-21064 Epistemic_statement denotes However, complete implementation of a consistent and accurate vocabulary, such as that presented by GO, will only work if scientists choose to sustain it through their own participation.
T80 21220-21361 Epistemic_statement denotes Although the quality of raw data, annotations, and database structure is clearly important, a database is only as useful as its search tools.
T81 21362-21536 Epistemic_statement denotes The database must be able to execute the questions posed by virologists, which for user convenience is often accomplished through the use of a Graphical User Interface (GUI).
T82 21537-21743 Epistemic_statement denotes Although it is impossible to predict all of the queries that might be requested, the system should be flexible enough to provide a reasonably close search that may require minor post-search data processing.
T83 21878-22134 Epistemic_statement denotes For example, in a study examining the genomic variation of H1N1 influenza viruses obtained from humans between the years of 2000-2010, the researchers would have specific search parameters regarding influenza A subtype, host species, and year of isolation.
T84 22135-22403 Epistemic_statement denotes If the search interface did not permit all of these search parameters, then the researcher would be left with an arduous task of manually sorting the results that, depending upon the volume of information and computer skills of the user, may be far too time consuming.
T85 22404-22614 Epistemic_statement denotes Another aspect of the searching, which is virus-specific, is that for viruses with high sequencing volumes at similar times and locations, such as influenza, there may be many identical genomes in the database.
T86 22840-23106 Epistemic_statement denotes Although it would be preferable to have database resources supported long term so that they can respond to users requests for new queries etc., it is not necessarily an efficient use of resources to build every feature requested by users into the database interface.
T87 23107-23294 Epistemic_statement denotes Clearly a cost-benefit analysis must be performed on the requests for features to be included in the software so that money and effort can be targeted at the most-used database functions.
T88 23295-23477 Epistemic_statement denotes However, the system must be able to provide users that have one-off analyses some basic filters to work with, while users must accept that it may off-load some data analysis to them.
T89 23664-23846 Epistemic_statement denotes FASTA formatting of nucleotide and protein sequences is a standard because multiple sequences can be incorporated into one file, and they can be read by many bioinformatics programs.
T90 23847-23905 Epistemic_statement denotes However, most annotations are stripped out of these files.
T91 23906-24048 Epistemic_statement denotes In contrast, GenBank files contain gene annotations, but these files can be tricky for software to read due to non-standard formatting errors.
T92 24049-24223 Epistemic_statement denotes The most basic output format is comma-separated values (csv), a tabular output that can often be read by other software, or even spreadsheet programs such as Microsoft Excel.
T93 24224-24295 Epistemic_statement denotes Yet, for many researchers data in these formats is very tedious to use.
T94 24296-24461 Epistemic_statement denotes Therefore, databases are frequently paired with visualization and analysis tools, or permit the export of data in a format that can be accepted by external programs.
T95 24462-24605 Epistemic_statement denotes For example, the Virology.ca database, VOCs, is linked to the Base-By-Base (BBB) visualization tool, which displays and allows editing of MSAs.
T96 24918-25011 Epistemic_statement denotes Databases are commonly divided by data type; however, this is not as simple as it might seem.
T97 25012-25163 Epistemic_statement denotes For example, a series of databases, each dedicated to a family of viruses, might all need to support many different types of molecular biological data.
T98 25164-25317 Epistemic_statement denotes Alternatively, databases managing a particular data type (eg, sequence or virion structure) would need to deal with many taxonomically different viruses.
T99 25378-25581 Epistemic_statement denotes Although this article aims to provide an updated overview of the biological databases relevant to viruses, as a print publication it is important to note that this resource will become quickly out-dated.
T100 25944-26097 Epistemic_statement denotes Alternatively, Google may be your friend, but it will help if the name of the database is not easily confused with a variety of other Internet resources.
T101 26098-26280 Epistemic_statement denotes The Virus Pathogen Resource (ViPR, pronounced viper) is particularly tricky to find if you don't know it's a Bioinformatics Resource Center and therefore you should look for viprbrc.
T102 26281-26429 Epistemic_statement denotes The search is further complicated by the existence of VIPERdb, a separate database of virus capsid structures and the VIPRE Antivirus software tool.
T103 26430-26560 Epistemic_statement denotes Although this review focuses on virus databases, most of these are reliant on other generic databases as the source of their data.
T104 27017-27141 Epistemic_statement denotes This provides up-to-date public access to nucleotide sequence data that can be accessed through any of the three interfaces.
T105 27542-27751 Epistemic_statement denotes This information helps in the reproducibility of genome assemblies in cases requiring review, and allows the user to analyse whether unexpected results are characteristic of the virus or a systematic artifact.
T106 27752-27957 Epistemic_statement denotes However, advances in sequencing technology raise the question of the value of saving raw sequencing data; re-sequencing of samples (given their availability) is fast, easy, cheap and increasingly accurate.
T107 28056-28222 Epistemic_statement denotes Therefore, if sequence errors are detected or new genes discovered within a genome by other research groups, it might be impossible to update the original submission.
T108 28223-28407 Epistemic_statement denotes To deal with this problem, NCBI has created the Reference Sequences resource (RefSeq), which offers up-to-date reference genomes for taxonomically diverse organisms, including viruses.
T109 28692-28798 Epistemic_statement denotes RefSeq files are not limited to nucleotide sequences, but also offer transcript and protein sequence data.
T110 29079-29731 Epistemic_statement denotes The database is divided into several branches: UniProtKB, the protein knowledgebase with two subsections, TrEMBL (translated EMBL Nucleotide sequence data library) that stores automatically annotated proteins prior to review and Swiss-Prot, containing proteins that have been manually annotated and reviewed, and often have associated literature; UniParc functions as an archive, sorting new, revised and obsolete sequences with a non-redundant numbering scheme allowing outdated UniProt references from past literature to be traceable; and UniRef100, 90 and 50 branches that cluster proteins into groups of 100%, 90% and 50% aa identity, respectively.
T111 29732-29937 Epistemic_statement denotes To ensure up-to-date public access, new protein sequences must be submitted to UniProt prior to publication, the new protein sequences associated with a genome submitted to GenBank are automatically added.
T112 29938-30129 Epistemic_statement denotes Knowledge of a protein's 3D structure can assist in the prediction of its functions and interactions, which are important aspects of understanding viral processes and drug and vaccine-design.
T113 30130-30245 Epistemic_statement denotes As a result, 3D structures have been determined for many viral proteins, and in some cases, of the complete virion.
T114 30246-30375 Epistemic_statement denotes The Protein Data Bank (PDB) collects all biochemical structures, but searches can be limited to the structures of viral proteins.
T115 30749-31325 Epistemic_statement denotes These include: VIPERdb, which is maintained at the Scripps Research Institute and is a database for icosahedral virus capsid structures; The Big Picture Book of Viruses, a large catalog of virus pictures with associated information; Virus World, a summary of pictures and links available from PDB and VIPERdb organized by virus name; and Viral Protein Structure Resource (ViPs), a database that aims to provide a central source for all viral protein structures and provides a genome map feature allowing the user to determine which genes have structural information available.
T116 31629-31836 Epistemic_statement denotes Several of these will be discussed below; however, as an aside, there are also aspects of the standard databases that provide useful compartmentalization of data to make working with specific viruses easier.
T117 32503-32655 Epistemic_statement denotes Understanding these differences can help categorize the various virus-specific databases, and address which types of information can be drawn from each.
T118 32736-32869 Epistemic_statement denotes At the most basic level, a website, can be used to present a collection of links to sequences in files or more traditional databases.
T119 33005-33155 Epistemic_statement denotes Although these resources usually deal with small data sets, they can help offer a more visually appealing and manageable access point for researchers.
T120 33156-33327 Epistemic_statement denotes Although the same files can be easily stored on a local desktop computer, there is an important benefit to accessing from a website -the data should be the same/unaltered.
T121 33328-33506 Epistemic_statement denotes Since most researchers tend to collect dozens of versions of sequences, often with various edits, accessing data from a website ensures the sequence is what it is supposed to be.
T122 34030-34165 Epistemic_statement denotes For example, when NCBI BLAST searches the sequence databases, it is possible to filter the results by keywords and taxonomy categories.
T123 34783-34980 Epistemic_statement denotes Examples of this genome-related data could include: G + C% content of genes and genomes, codon composition of genes, pI of proteins, predicted MW of proteins and amino acid composition of proteins.
T124 35326-35663 Epistemic_statement denotes The remainder of this section will compare and contrast three types of virus database resources: the Influenza Virus Resource at NCBI; the Virus Pathogen Resource Bioinformatics Resource Center (ViPRbrc) supported by the NIH, which supports a variety of viruses; and the author's Virology.ca resource, which supports large dsDNA viruses.
T125 35879-36050 Epistemic_statement denotes The search interface supports the query of sequences (protein, CDS or nucleotide) by influenza type, genome segment, serotype, collection date, host and country of origin.
T126 36051-36209 Epistemic_statement denotes Additional filters can remove sequences that are not full length (of segment), that are not part of a full genome set, or that are identical to another virus.
T127 36210-36346 Epistemic_statement denotes In addition, pandemic H1N1 sequences can be included or excluded, as can sequences from lab strains, vaccines or certain other projects.
T128 36347-36508 Epistemic_statement denotes Once selected, sequences can be exported to a local computer or aligned with subsequent generation of a phylogenetic tree; the interface is completely web-based.
T129 36794-36952 Epistemic_statement denotes ViPRbrc is funded by NIH to support research on viruses on the NIAID Category A-C Priority Pathogen lists, and those causing (re)emerging infectious diseases.
T130 37695-37814 Epistemic_statement denotes Although the database selection, retrieval and analysis tools are still web-based, they are comprehensive and advanced.
T131 37815-37936 Epistemic_statement denotes Some of the analyses provide graphical output (eg, genome maps), but have limited comparative tools for further analysis.
T132 38105-38287 Epistemic_statement denotes This allows the user to store a variety of retrieved sequence sets, eliminating the need to repeat the tedious selection process, which in turn likely increases accuracy of the work.
T133 38515-38664 Epistemic_statement denotes Data can also be exported from the database in a variety of formats, including GFF3, which is a tab-delimited format for describing genomic features.
T134 38942-39256 Epistemic_statement denotes To this aim, ViPRbrc tools allow BLAST searches of custom (virus-specific) databases and the creation and visualization of MSAs and phylogenetic trees, and one of the more valuable services provided is the tool for Analysis of Sequence Variation, that is SNPs (also amino acid variation) along sequences of an MSA.
T135 39257-39457 Epistemic_statement denotes In contrast to ViPRbrc, Virology.ca is dedicated to the support of large dsDNA viruses and as noted previously, this involves the provision of different tools to perform comparative genomics analyses.
T136 39458-39668 Epistemic_statement denotes Although the genomes are 20 -30 times the size of the RNA viruses, the current poxvirus database at Virology.ca contains less than 270 genomes, with the most studied virus species having 30 -50 representatives.
T137 39819-40000 Epistemic_statement denotes However, only 40 are present in all poxvirus genomes, but this conserved set of core genes increases to about 80 if poxviruses that infect insects are excluded from the calculation.
T138 40001-40096 Epistemic_statement denotes The remaining non-essential genes are often associated with host range and virulence functions.
T139 40097-40325 Epistemic_statement denotes In the Orthopoxvirus genus, the biggest genomes tend to be found in viruses with the widest host range; it is thought that the restriction of viruses to a limited host range is associated with the loss of genes or gene function.
T140 41763-41843 Epistemic_statement denotes The program can also be used to view RNA-Seq data and analyze for recombination.
T141 41844-42184 Epistemic_statement denotes BBB offers its own format of data storage, an XML file (.bbb), that allows for additional features such as the storage of user comments, primer annotations, and genome MSAs with gene annotations JDotter A program for generating dotplots; suitable for whole genomes, sub-genomes or protein sequences Genome Annotation Transfer Utility (GATU)
T142 42185-42261 Epistemic_statement denotes A tool used to annotate genomes based on a closely related reference genome.
T143 42420-42544 Epistemic_statement denotes GATU also suggests novel genes to the human annotator, who has last word on the annotation process Sequence Searcher (SSeq):
T144 42545-42645 Epistemic_statement denotes An easy-to-use Java tool for searching protein and DNA sequences for user-specified sequence motifs.
T145 42714-43016 Epistemic_statement denotes These features are helpful when working with the large DNA viruses because of the uncertainty associated with the annotation of various genes -genes that are predicted to encode small proteins with a very high or very low pI and unusual amino acid composition are likely to represent annotation errors.
T146 43017-43200 Epistemic_statement denotes Indeed, many of the features present in Base-By-Base and VGO, the tools that display the genome sequences and genome maps, respectively are devoted to helping solve annotation issues.
T147 43201-43345 Epistemic_statement denotes For the large DNA viruses, accurate annotation is very important because a common investigation asks, Why is virus X more virulent than virus Y?
T148 43550-43662 Epistemic_statement denotes So far, we have divided discussion issues by genome type; however, two additional databases should be mentioned.
T149 45150-45279 Epistemic_statement denotes This information, defining relationships among viruses, aids in the classification and understanding of newly discovered viruses.
T150 45885-46016 Epistemic_statement denotes However, there are 78 families of viruses that not assigned to an order, including one called Unassigned, which contains 14 genera.
T151 46269-46477 Epistemic_statement denotes Sequences need to be organized to facilitate the process of similarity searching, for example with BLAST, for related sequences and also organized by functional or source (human, rodent, virus) relationships.
T152 46478-46690 Epistemic_statement denotes However, it's neither feasible nor affordable to create independent database resources for every organism, therefore resources like ViPRbrc that can function with a variety of viruses may become more commonplace.
T153 46691-46862 Epistemic_statement denotes With respect to genome sequencing, one of the greatest problems lies with the huge volume of raw data (sequencing reads) that is associated with any final genome sequence.
T154 46863-47098 Epistemic_statement denotes The authors recently received 3 GB of compressed sequencing data, a mix of host and virus sequences, to assemble a 150 kb poxvirus genome; it is estimated that genomic data will soon become the world's largest consumer of disk storage.
T155 47456-47646 Epistemic_statement denotes Perhaps, in the not too distant future, it will not be unusual to have our own genome sequenced, as well as our various organ microbiomes and the genomes of any pathogens we are infected by.
T156 47647-48053 Epistemic_statement denotes To deal with the benefits of the new sequencing technology's ability to generate massive sequence coverage, which helps to reduce errors in the initial sequencing process and reveals the natural variation among the genomes of a virus population, we must also develop new annotation strategies such as how to annotate SNPs in virus genomes and gene fragments in the non-essential gene sets of large viruses.
T157 48054-48164 Epistemic_statement denotes The goal of this type of annotation is to provide the virologist with more information in comparative studies.
T158 48299-48443 Epistemic_statement denotes The questioner is really asking about the presence of functional genes, but would likely be interested in knowing the mechanism of the deletion.
T159 48444-48592 Epistemic_statement denotes For example, whether the gene was inactivated by a series of deletions or a single nucleotide change that could be the result of a sequencing error.
T160 48593-48910 Epistemic_statement denotes Not only is the number of genomes requiring annotation increasing exponentially, but the nature of the data is also changing with metagenomics' ability to determine all of the nucleotide sequences from a crude environmental sample without the need to culture microbes in the laboratory or probe for a particular gene.
T161 48911-49126 Epistemic_statement denotes Since many organisms fail to grow in the laboratory, we cannot grow viruses that are unique to them; thus, metagenomics provides a valuable window into the previously unknown diversity of viruses in the environment.
T162 49127-49264 Epistemic_statement denotes Even with large-scale environmental sequencing under way, most metagenomic sequences appear to be unrelated to currently known sequences.
T163 49365-49452 Epistemic_statement denotes However, metagenomics also comes with its own list of special problems and limitations.
T164 49549-49675 Epistemic_statement denotes Genomic assembly can be difficult because the metagenomic short reads often provide less coverage than traditional sequencing.
T165 49881-50027 Epistemic_statement denotes Although assembly against a reference genome is helpful, this approach can't be used with environmental samples full of unknown microbial species.
T166 50168-50312 Epistemic_statement denotes Furthermore, metadata is often left out of annotations, yet it is required for the data to be put into a useful, comparable, biological context.
T167 50483-50606 Epistemic_statement denotes Thus, it is crucial that annotation of metadata be included as a mandatory and standardized component of genome submission.
T168 50607-50913 Epistemic_statement denotes With the accumulation of more and more diverse sequences from metagenomic sequencing of environmental samples, the organization of gene/protein sequences into Clusters of Orthologous groups (COGs; http://www.ncbi.nlm.nih.gov/COG/) and Viral COGs (VOGs) becomes especially useful for defining relationships.
T169 50914-51260 Epistemic_statement denotes Even when matches between orthologs may be at the limit of detection, the greater the number of sequences in a cluster, the greater the chance of making new matches by virtue of transitive sequence comparisons, that is SeqA matches SeqB, but not SeqC; if SeqB also matches SeqC then a transitive relationship between SeqA and SeqC is established.
T170 51678-52274 Epistemic_statement denotes Several features of this updated version may help increase the knowledge and use of ortholog groups by the virology community: the new web interface is simplified for less experienced users and offers new browsing and visualization capabilities; the new version has improved consistency and impact through incorporation and linking of Gene Ontology terms, KEGG pathways and UniProt and SMART/Pfam domains to ortholog group assignments; and the algorithms and pipelines for assignment are made freely available on the website, encouraging individuals to apply the techniques to their own datasets.
T171 52275-52582 Epistemic_statement denotes In recent years, the growth in both the volume and types of data that can be considered bioinformatic in nature has forced the scientific community to consider how much of a role that the manual forms of genome annotation, curation and maintenance can continue to play in assigning knowledge to new genomes.
T172 52583-52771 Epistemic_statement denotes A compromise may be found in the maintenance of manually curated reference genomes, and development of programs to aid in increasing the accuracy of automated annotations of new sequences.
T173 52772-52997 Epistemic_statement denotes Finally, the question remains whether funding will be made available for tailoring databases to the needs of virology researchers working on the wide array of bioinformatics challenges that these new data sets are generating.