PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T1	213-462	Epistemic_statement	denotes	Discussions will be illustrated with a limited number of virus databases, including our own -the Viral Orthologous Clusters (VOCs) database, which forms the core of Virology.ca that supports researchers working with a variety of large dsDNA viruses.
T2	463-730	Epistemic_statement	denotes	Depending where you are in the world, GenBank, (maintained by the National Center for Biotechnology Information, NCBI, USA), the EMBL Data Library (maintained in the UK) or the DNA Databank of Japan may first come to mind when thinking about bioinformatics databases.
T3	731-854	Epistemic_statement	denotes	However, these large repositories of genomic, and other, sequence information have a relatively simple flat-file structure.
T4	855-1034	Epistemic_statement	denotes	When submitting a viral genome sequence to one these databases, the sequence data must be annotated with various information including the positions of coding DNA sequences (CDS).
T5	1035-1133	Epistemic_statement	denotes	Flat-file databases can only be queried for information that is explicitly stated within the file.
T6	1134-1255	Epistemic_statement	denotes	For example, queries can be made by keywords in the protein name, but not derivative characteristics such as gene length.
T7	1659-1816	Epistemic_statement	denotes	When building such a relational database, the designer must define the structure of all of the data records to be stored and their association to each other.
T8	1817-2016	Epistemic_statement	denotes	The database schema illustrates these relationships and acts as a map to the organization of the data storage locations, that is specific containers (dataset tables) for storing each type of dataset.
T9	2320-2441	Epistemic_statement	denotes	Various file formats aim to capture this viral sequence data and associated knowledge, including GenBank and XML formats.
T10	2553-2651	Epistemic_statement	denotes	Similarly, these files can be worked with using standard text editors or word-processing programs.
T11	2652-2842	Epistemic_statement	denotes	A downside is that minor changes to the format of GenBank files can cause software tools that read these files (parsers) to fail if they have not been programmed to handle these differences.
T12	3153-3307	Epistemic_statement	denotes	Although XML files are excellent for representing relationships in a form that can be read by a computer, they are more difficult for users to understand.
T13	3596-3823	Epistemic_statement	denotes	However, if a database is to be useful to the virology research community, then there must be at least one additional component in this system -an easy to use interface to enable a virologist to interact with the database, i.e.
T14	3903-3974	Epistemic_statement	denotes	Furthermore, the database itself may form part of an analysis pipeline.
T15	4145-4386	Epistemic_statement	denotes	Rather than the database returning a file of the sequences to be formatted before submission to a program such as ClustalO or MUSCLE for alignment, it might be desirable to have the database send the sequences directly to the alignment tool.
T16	4387-4535	Epistemic_statement	denotes	The tool can then run the desired function before returning the processed data within a visualization tool that simplifies editing of the final MSA.
T17	4536-4690	Epistemic_statement	denotes	Although simplification of this process with default parameters is clearly valuable to the user, over-simplification can lead to restricted function, i.e.
T18	4739-4966	Epistemic_statement	denotes	Therefore, it is valuable to provide a variety of export options allowing users to obtain formats that can be used directly with multiple analysis tools and permit full range of parameter access, as discussed in later sections.
T19	4967-5099	Epistemic_statement	denotes	Design of the user interface with potential queries and flexibility of data retrieval in mind is key to the utility of the database.
T20	5238-5461	Epistemic_statement	denotes	Generally, the databases are optimized for two main functionalities: 1) storing and updating structured data with high integrity and 2) providing tools for searching, retrieving, summarizing and possibly analyzing the data.
T21	5753-5863	Epistemic_statement	denotes	The relationships between these tables are explicitly defined and are not limited to one-to-one relationships.
T22	6010-6125	Epistemic_statement	denotes	An additional aspect of any discussion on database design is maintenance, which can be viewed from several aspects.
T23	6481-6620	Epistemic_statement	denotes	A second view of maintenance, which is more integral to database design, concerns the data: 1) how will data be imported into the database?
T24	6621-6672	Epistemic_statement	denotes	2) will data need to be edited (updated/corrected)?
T25	6673-6757	Epistemic_statement	denotes	Again, these questions have a natural division along the lines of virus genome type.
T26	6758-7003	Epistemic_statement	denotes	For example, some form of automated process is essential to import the thousands of genomes required for an influenza or HCV (small genomes, fixed gene complement) resource, but these small sequences are unlikely to need subsequent modification.
T27	7004-7269	Epistemic_statement	denotes	In contrast, a resource such as Virology.ca, which deals with large poxviruses, needs to contend with far less genomes, but must handle changes to genome annotation both before and after import into the database due to large, less well characterized, dsDNA genomes.
T28	7270-7502	Epistemic_statement	denotes	Therefore, while a software script to automatically collect new genome GenBank files and insert them into a database might be feasible for an influenza virus database, this process cannot be used for poxvirus genomes in Virology.ca.
T29	7503-7606	Epistemic_statement	denotes	The complexity of the latter frequently leads to variations (sometimes errors) in annotation protocols.
T30	7607-7715	Epistemic_statement	denotes	Furthermore, new discoveries surrounding gene function mean that old genomes must sometimes be re-annotated.
T31	7716-7907	Epistemic_statement	denotes	Therefore, key to the design of database systems like Virology.ca is the inclusion of a manual, user-friendly database-editing tool that will allow virologists to enter and maintain the data.
T32	7908-8022	Epistemic_statement	denotes	Another frequent need for database editing in Virology.ca results from the assignment of genes to ortholog groups.
T33	8353-8595	Epistemic_statement	denotes	Although orthology is simply the prediction of a common ancestor between genes of different species, the extreme diversity among viruses in a single taxonomic family creates difficulties in the accurate assignment of genes to ortholog groups.
T34	9098-9203	Epistemic_statement	denotes	Unfortunately, this work is sometimes viewed as a non-discovery based science and funding is problematic.
T35	9380-9561	Epistemic_statement	denotes	Although many bioinformatics databases may be in some way relevant to viruses, the majority of virologists tend to think of, and also use most frequently, genome sequence databases.
T36	9562-9668	Epistemic_statement	denotes	As a result, it would be hard to find a virologist who hasn't performed a BLAST search of these databases.
T37	9669-9892	Epistemic_statement	denotes	Similarly, any virologist who has partially or completely sequenced a viral genome is familiar with the process of annotating the sequence for submission to GenBank (or local equivalent) before their paper can be published.
T38	9893-10189	Epistemic_statement	denotes	Three levels of annotated information exist: basic information within a GenBank file that can be directly transferred into databases; values calculated or predicted from the basic data; and curated information, beyond that provided when first developing the GenBank file, added manually by users.
T39	10190-10429	Epistemic_statement	denotes	The GenBank sequence file usually contains basic virus identification information, affiliations of the annotator, CDS locations within the genome, metadata, gene and product type/names, as well as mRNA splicing information, if appropriate.
T40	10549-10747	Epistemic_statement	denotes	However, the sequence data itself can be used to calculate various values that might be required by the end-user such as A + T content, protein pI, molecular mass of proteins and amino acid content.
T41	10748-10928	Epistemic_statement	denotes	Additionally, the data may be used as input for predictive tools, such as those that look for functional motifs that are associated with enzymatic function or protein localization.
T42	10929-10985	Epistemic_statement	denotes	Manual curation can also provide additional information.
T43	10986-11207	Epistemic_statement	denotes	For example, there is generally no information describing the likely reliability of the annotations or the source of the information (experimental data, software prediction or scientific literature) within a GenBank file.
T44	11277-11431	Epistemic_statement	denotes	Unfortunately, it is rare that this kind of supplementary information is incorporated into databases in a manner that can be searched in a meaningful way.
T45	11432-11610	Epistemic_statement	denotes	Keeping record of the reliability of annotations is especially important when investigating the genes of unknown function, frequently found within the varied large dsDNA viruses.
T46	11611-11729	Epistemic_statement	denotes	Functional assignment of the associated proteins relies heavily on the transfer of function from orthologous proteins.
T47	11730-11951	Epistemic_statement	denotes	Orthologous proteins, which may only be 20-30% identical, often do not match the same set of functional motifs, and vary in their ability to match distantly-related proteins that might give important clues as to function.
T48	12074-12175	Epistemic_statement	denotes	As noted above, accurate genome annotation is critical to providing a useful bioinformatics database.
T49	12176-12278	Epistemic_statement	denotes	However, this must be built upon a foundation of accurate, and preferably complete, genome sequencing.
T50	12279-12373	Epistemic_statement	denotes	The complexity of the annotation problem increases exponentially with genome size and novelty.
T51	12558-12968	Epistemic_statement	denotes	Large viruses, such as these, bring two additional complications to the annotation problem: 1) only a subset of the viral genes are essential, so some orthologs may be functional in one species but fragmented in another and, 2) some of the more diverse viruses have > 100 genes currently annotated as "hypothetical", and even the DNA polymerase protein may share only 32% aa identity with its closest relative.
T52	13125-13370	Epistemic_statement	denotes	Without the use of either a closely related genome, or multiple more distant relatives with similar gene sets, for reference genomes in annotation, it is very difficult to predict which ORFs are: true genes, gene fragments, or random small ORFs.
T53	13486-13623	Epistemic_statement	denotes	Although some groups share common physical virion structures, often only a few genes are found in common between related groups of phage.
T54	13783-13870	Epistemic_statement	denotes	Thus, the complexity of annotation relates directly to viral genome size and diversity.
T55	13993-14121	Epistemic_statement	denotes	Additionally, this process can often be fully automated as part of a pipeline to import newly sequenced genomes into a database.
T56	14279-14542	Epistemic_statement	denotes	The IRD pipeline aligns input nucleotide sequences against a consensus sequence profile to "identify possible sequencing errors, determine the influenza type, segment number, and for segments 4 and 6 of type A, the subtype, and translate the nucleotide sequence".
T57	15226-15430	Epistemic_statement	denotes	Both partial matches and potential novel genes are flagged in the GATU interface for the annotator to investigate further and make a final decision; tools are included in GATU to make this process easier.
T58	15550-15722	Epistemic_statement	denotes	The majority of these tools depend upon similarity searches against previously characterized genes, which can result in underannotation of unique or highly divergent genes.
T59	15723-15832	Epistemic_statement	denotes	For this reason, the use of a general list of genes may be more useful than using a specific reference virus.
T60	15833-16067	Epistemic_statement	denotes	Some annotation tools, especially for prokaryotic and eukaryotic genomes, attempt gene prediction -searching for promoter-like sequences and other gene characteristics, however, these are less useful for the annotation of viral genes.
T61	16259-16345	Epistemic_statement	denotes	This is invaluable in the annotation of large and complex viruses, such as poxviruses.
T62	16682-16795	Epistemic_statement	denotes	Although this change might alter a protein's function, most annotators would probably annotate this altered gene.
T63	16938-17218	Epistemic_statement	denotes	In this example the greatly shortened protein is very unlikely to be functional, however, because the second part of the gene is still present and represents 70% of the original gene this is sometimes annotated (Cii) even though it is very unlikely to be translated from the mRNA.
T64	17349-17444	Epistemic_statement	denotes	Although the protein product is not changed, the larger ORF is sometimes erroneously annotated.
T65	17445-17595	Epistemic_statement	denotes	These annotation errors illustrate the need for more complex automated evaluation of annotation results or the continued input of the human annotator.
T66	17596-17771	Epistemic_statement	denotes	Although the various databases and associated sequence searching/analysis tools comprise a very valuable set of resources, it is also important to recognize their limitations.
T67	17772-17931	Epistemic_statement	denotes	Researchers need to remain wary of the various types of information because regardless of the source, errors and out-of-date files can fall through the cracks.
T68	18086-18220	Epistemic_statement	denotes	If available, reference genomes can be very valuable and be used to predict/correct potential errors either automatically or manually.
T69	18424-18599	Epistemic_statement	denotes	However, many of the genes present in the large DNA viruses such as poxviruses are non-essential; they can be, and often are, truncated in some viruses and complete in others.
T70	18600-18697	Epistemic_statement	denotes	Thus, researchers must remain aware of potential errors, regardless of the source or genome type.
T71	18941-19059	Epistemic_statement	denotes	• Errors of ignorance -a simple lack of knowledge: ortholog/paralog relationships not understood or viruses mis-named.
T72	19295-19386	Epistemic_statement	denotes	Another limitation of most databases is the absence of evidence for particular annotations.
T73	19497-19637	Epistemic_statement	denotes	However, even when evidence notes are present, there may be problems since experiments may eventually be discredited and/or require updates.
T74	19638-19758	Epistemic_statement	denotes	Curating such a system, essentially with the annotations as a living document, would also be incredibly labor intensive.
T75	19759-19965	Epistemic_statement	denotes	The lack of an unambiguous and controlled naming standard that is carried across all viruses and databases results in variable descriptions and data that are difficult to query for specific characteristics.
T76	19966-20139	Epistemic_statement	denotes	For example, software may not be programmed to recognise synonymous terms such as "ssDNA," "single strand DNA," "single-stranded DNA," and "single stranded DNA" as the same.
T77	20140-20288	Epistemic_statement	denotes	Where possible, import systems should have locks on permitted words, based on a universal, controlled vocabulary, in order to reduce this ambiguity.
T78	20668-20877	Epistemic_statement	denotes	In addition, although some terms will still be applicable to only particular viruses (eg, the phages, see viral head-tail joining; GO:0098005), GO viral terms are designed to be species-neutral where possible.
T79	20878-21064	Epistemic_statement	denotes	However, complete implementation of a consistent and accurate vocabulary, such as that presented by GO, will only work if scientists choose to sustain it through their own participation.
T80	21220-21361	Epistemic_statement	denotes	Although the quality of raw data, annotations, and database structure is clearly important, a database is only as useful as its search tools.
T81	21362-21536	Epistemic_statement	denotes	The database must be able to execute the questions posed by virologists, which for user convenience is often accomplished through the use of a Graphical User Interface (GUI).
T82	21537-21743	Epistemic_statement	denotes	Although it is impossible to predict all of the queries that might be requested, the system should be flexible enough to provide a reasonably close search that may require minor post-search data processing.
T83	21878-22134	Epistemic_statement	denotes	For example, in a study examining the genomic variation of H1N1 influenza viruses obtained from humans between the years of 2000-2010, the researchers would have specific search parameters regarding influenza A subtype, host species, and year of isolation.
T84	22135-22403	Epistemic_statement	denotes	If the search interface did not permit all of these search parameters, then the researcher would be left with an arduous task of manually sorting the results that, depending upon the volume of information and computer skills of the user, may be far too time consuming.
T85	22404-22614	Epistemic_statement	denotes	Another aspect of the searching, which is virus-specific, is that for viruses with high sequencing volumes at similar times and locations, such as influenza, there may be many identical genomes in the database.
T86	22840-23106	Epistemic_statement	denotes	Although it would be preferable to have database resources supported long term so that they can respond to users requests for new queries etc., it is not necessarily an efficient use of resources to build every feature requested by users into the database interface.
T87	23107-23294	Epistemic_statement	denotes	Clearly a cost-benefit analysis must be performed on the requests for features to be included in the software so that money and effort can be targeted at the most-used database functions.
T88	23295-23477	Epistemic_statement	denotes	However, the system must be able to provide users that have one-off analyses some basic filters to work with, while users must accept that it may off-load some data analysis to them.
T89	23664-23846	Epistemic_statement	denotes	FASTA formatting of nucleotide and protein sequences is a standard because multiple sequences can be incorporated into one file, and they can be read by many bioinformatics programs.
T90	23847-23905	Epistemic_statement	denotes	However, most annotations are stripped out of these files.
T91	23906-24048	Epistemic_statement	denotes	In contrast, GenBank files contain gene annotations, but these files can be tricky for software to read due to non-standard formatting errors.
T92	24049-24223	Epistemic_statement	denotes	The most basic output format is comma-separated values (csv), a tabular output that can often be read by other software, or even spreadsheet programs such as Microsoft Excel.
T93	24224-24295	Epistemic_statement	denotes	Yet, for many researchers data in these formats is very tedious to use.
T94	24296-24461	Epistemic_statement	denotes	Therefore, databases are frequently paired with visualization and analysis tools, or permit the export of data in a format that can be accepted by external programs.
T95	24462-24605	Epistemic_statement	denotes	For example, the Virology.ca database, VOCs, is linked to the Base-By-Base (BBB) visualization tool, which displays and allows editing of MSAs.
T96	24918-25011	Epistemic_statement	denotes	Databases are commonly divided by data type; however, this is not as simple as it might seem.
T97	25012-25163	Epistemic_statement	denotes	For example, a series of databases, each dedicated to a family of viruses, might all need to support many different types of molecular biological data.
T98	25164-25317	Epistemic_statement	denotes	Alternatively, databases managing a particular data type (eg, sequence or virion structure) would need to deal with many taxonomically different viruses.
T99	25378-25581	Epistemic_statement	denotes	Although this article aims to provide an updated overview of the biological databases relevant to viruses, as a print publication it is important to note that this resource will become quickly out-dated.
T100	25944-26097	Epistemic_statement	denotes	Alternatively, Google may be your friend, but it will help if the name of the database is not easily confused with a variety of other Internet resources.
T101	26098-26280	Epistemic_statement	denotes	The Virus Pathogen Resource (ViPR, pronounced viper) is particularly tricky to find if you don't know it's a Bioinformatics Resource Center and therefore you should look for viprbrc.
T102	26281-26429	Epistemic_statement	denotes	The search is further complicated by the existence of VIPERdb, a separate database of virus capsid structures and the VIPRE Antivirus software tool.
T103	26430-26560	Epistemic_statement	denotes	Although this review focuses on virus databases, most of these are reliant on other generic databases as the source of their data.
T104	27017-27141	Epistemic_statement	denotes	This provides up-to-date public access to nucleotide sequence data that can be accessed through any of the three interfaces.
T105	27542-27751	Epistemic_statement	denotes	This information helps in the reproducibility of genome assemblies in cases requiring review, and allows the user to analyse whether unexpected results are characteristic of the virus or a systematic artifact.
T106	27752-27957	Epistemic_statement	denotes	However, advances in sequencing technology raise the question of the value of saving raw sequencing data; re-sequencing of samples (given their availability) is fast, easy, cheap and increasingly accurate.
T107	28056-28222	Epistemic_statement	denotes	Therefore, if sequence errors are detected or new genes discovered within a genome by other research groups, it might be impossible to update the original submission.
T108	28223-28407	Epistemic_statement	denotes	To deal with this problem, NCBI has created the Reference Sequences resource (RefSeq), which offers up-to-date reference genomes for taxonomically diverse organisms, including viruses.
T109	28692-28798	Epistemic_statement	denotes	RefSeq files are not limited to nucleotide sequences, but also offer transcript and protein sequence data.
T110	29079-29731	Epistemic_statement	denotes	The database is divided into several branches: UniProtKB, the protein knowledgebase with two subsections, TrEMBL (translated EMBL Nucleotide sequence data library) that stores automatically annotated proteins prior to review and Swiss-Prot, containing proteins that have been manually annotated and reviewed, and often have associated literature; UniParc functions as an archive, sorting new, revised and obsolete sequences with a non-redundant numbering scheme allowing outdated UniProt references from past literature to be traceable; and UniRef100, 90 and 50 branches that cluster proteins into groups of 100%, 90% and 50% aa identity, respectively.
T111	29732-29937	Epistemic_statement	denotes	To ensure up-to-date public access, new protein sequences must be submitted to UniProt prior to publication, the new protein sequences associated with a genome submitted to GenBank are automatically added.
T112	29938-30129	Epistemic_statement	denotes	Knowledge of a protein's 3D structure can assist in the prediction of its functions and interactions, which are important aspects of understanding viral processes and drug and vaccine-design.
T113	30130-30245	Epistemic_statement	denotes	As a result, 3D structures have been determined for many viral proteins, and in some cases, of the complete virion.
T114	30246-30375	Epistemic_statement	denotes	The Protein Data Bank (PDB) collects all biochemical structures, but searches can be limited to the structures of viral proteins.
T115	30749-31325	Epistemic_statement	denotes	These include: VIPERdb, which is maintained at the Scripps Research Institute and is a database for icosahedral virus capsid structures; The Big Picture Book of Viruses, a large catalog of virus pictures with associated information; Virus World, a summary of pictures and links available from PDB and VIPERdb organized by virus name; and Viral Protein Structure Resource (ViPs), a database that aims to provide a central source for all viral protein structures and provides a genome map feature allowing the user to determine which genes have structural information available.
T116	31629-31836	Epistemic_statement	denotes	Several of these will be discussed below; however, as an aside, there are also aspects of the standard databases that provide useful compartmentalization of data to make working with specific viruses easier.
T117	32503-32655	Epistemic_statement	denotes	Understanding these differences can help categorize the various virus-specific databases, and address which types of information can be drawn from each.
T118	32736-32869	Epistemic_statement	denotes	At the most basic level, a website, can be used to present a collection of links to sequences in files or more traditional databases.
T119	33005-33155	Epistemic_statement	denotes	Although these resources usually deal with small data sets, they can help offer a more visually appealing and manageable access point for researchers.
T120	33156-33327	Epistemic_statement	denotes	Although the same files can be easily stored on a local desktop computer, there is an important benefit to accessing from a website -the data should be the same/unaltered.
T121	33328-33506	Epistemic_statement	denotes	Since most researchers tend to collect dozens of versions of sequences, often with various edits, accessing data from a website ensures the sequence is what it is supposed to be.
T122	34030-34165	Epistemic_statement	denotes	For example, when NCBI BLAST searches the sequence databases, it is possible to filter the results by keywords and taxonomy categories.
T123	34783-34980	Epistemic_statement	denotes	Examples of this genome-related data could include: G + C% content of genes and genomes, codon composition of genes, pI of proteins, predicted MW of proteins and amino acid composition of proteins.
T124	35326-35663	Epistemic_statement	denotes	The remainder of this section will compare and contrast three types of virus database resources: the Influenza Virus Resource at NCBI; the Virus Pathogen Resource Bioinformatics Resource Center (ViPRbrc) supported by the NIH, which supports a variety of viruses; and the author's Virology.ca resource, which supports large dsDNA viruses.
T125	35879-36050	Epistemic_statement	denotes	The search interface supports the query of sequences (protein, CDS or nucleotide) by influenza type, genome segment, serotype, collection date, host and country of origin.
T126	36051-36209	Epistemic_statement	denotes	Additional filters can remove sequences that are not full length (of segment), that are not part of a full genome set, or that are identical to another virus.
T127	36210-36346	Epistemic_statement	denotes	In addition, pandemic H1N1 sequences can be included or excluded, as can sequences from lab strains, vaccines or certain other projects.
T128	36347-36508	Epistemic_statement	denotes	Once selected, sequences can be exported to a local computer or aligned with subsequent generation of a phylogenetic tree; the interface is completely web-based.
T129	36794-36952	Epistemic_statement	denotes	ViPRbrc is funded by NIH to support research on viruses on the NIAID Category A-C Priority Pathogen lists, and those causing (re)emerging infectious diseases.
T130	37695-37814	Epistemic_statement	denotes	Although the database selection, retrieval and analysis tools are still web-based, they are comprehensive and advanced.
T131	37815-37936	Epistemic_statement	denotes	Some of the analyses provide graphical output (eg, genome maps), but have limited comparative tools for further analysis.
T132	38105-38287	Epistemic_statement	denotes	This allows the user to store a variety of retrieved sequence sets, eliminating the need to repeat the tedious selection process, which in turn likely increases accuracy of the work.
T133	38515-38664	Epistemic_statement	denotes	Data can also be exported from the database in a variety of formats, including GFF3, which is a tab-delimited format for describing genomic features.
T134	38942-39256	Epistemic_statement	denotes	To this aim, ViPRbrc tools allow BLAST searches of custom (virus-specific) databases and the creation and visualization of MSAs and phylogenetic trees, and one of the more valuable services provided is the tool for Analysis of Sequence Variation, that is SNPs (also amino acid variation) along sequences of an MSA.
T135	39257-39457	Epistemic_statement	denotes	In contrast to ViPRbrc, Virology.ca is dedicated to the support of large dsDNA viruses and as noted previously, this involves the provision of different tools to perform comparative genomics analyses.
T136	39458-39668	Epistemic_statement	denotes	Although the genomes are 20 -30 times the size of the RNA viruses, the current poxvirus database at Virology.ca contains less than 270 genomes, with the most studied virus species having 30 -50 representatives.
T137	39819-40000	Epistemic_statement	denotes	However, only 40 are present in all poxvirus genomes, but this conserved set of core genes increases to about 80 if poxviruses that infect insects are excluded from the calculation.
T138	40001-40096	Epistemic_statement	denotes	The remaining non-essential genes are often associated with host range and virulence functions.
T139	40097-40325	Epistemic_statement	denotes	In the Orthopoxvirus genus, the biggest genomes tend to be found in viruses with the widest host range; it is thought that the restriction of viruses to a limited host range is associated with the loss of genes or gene function.
T140	41763-41843	Epistemic_statement	denotes	The program can also be used to view RNA-Seq data and analyze for recombination.
T141	41844-42184	Epistemic_statement	denotes	BBB offers its own format of data storage, an XML file (.bbb), that allows for additional features such as the storage of user comments, primer annotations, and genome MSAs with gene annotations JDotter A program for generating dotplots; suitable for whole genomes, sub-genomes or protein sequences Genome Annotation Transfer Utility (GATU)
T142	42185-42261	Epistemic_statement	denotes	A tool used to annotate genomes based on a closely related reference genome.
T143	42420-42544	Epistemic_statement	denotes	GATU also suggests novel genes to the human annotator, who has last word on the annotation process Sequence Searcher (SSeq):
T144	42545-42645	Epistemic_statement	denotes	An easy-to-use Java tool for searching protein and DNA sequences for user-specified sequence motifs.
T145	42714-43016	Epistemic_statement	denotes	These features are helpful when working with the large DNA viruses because of the uncertainty associated with the annotation of various genes -genes that are predicted to encode small proteins with a very high or very low pI and unusual amino acid composition are likely to represent annotation errors.
T146	43017-43200	Epistemic_statement	denotes	Indeed, many of the features present in Base-By-Base and VGO, the tools that display the genome sequences and genome maps, respectively are devoted to helping solve annotation issues.
T147	43201-43345	Epistemic_statement	denotes	For the large DNA viruses, accurate annotation is very important because a common investigation asks, Why is virus X more virulent than virus Y?
T148	43550-43662	Epistemic_statement	denotes	So far, we have divided discussion issues by genome type; however, two additional databases should be mentioned.
T149	45150-45279	Epistemic_statement	denotes	This information, defining relationships among viruses, aids in the classification and understanding of newly discovered viruses.
T150	45885-46016	Epistemic_statement	denotes	However, there are 78 families of viruses that not assigned to an order, including one called Unassigned, which contains 14 genera.
T151	46269-46477	Epistemic_statement	denotes	Sequences need to be organized to facilitate the process of similarity searching, for example with BLAST, for related sequences and also organized by functional or source (human, rodent, virus) relationships.
T152	46478-46690	Epistemic_statement	denotes	However, it's neither feasible nor affordable to create independent database resources for every organism, therefore resources like ViPRbrc that can function with a variety of viruses may become more commonplace.
T153	46691-46862	Epistemic_statement	denotes	With respect to genome sequencing, one of the greatest problems lies with the huge volume of raw data (sequencing reads) that is associated with any final genome sequence.
T154	46863-47098	Epistemic_statement	denotes	The authors recently received 3 GB of compressed sequencing data, a mix of host and virus sequences, to assemble a 150 kb poxvirus genome; it is estimated that genomic data will soon become the world's largest consumer of disk storage.
T155	47456-47646	Epistemic_statement	denotes	Perhaps, in the not too distant future, it will not be unusual to have our own genome sequenced, as well as our various organ microbiomes and the genomes of any pathogens we are infected by.
T156	47647-48053	Epistemic_statement	denotes	To deal with the benefits of the new sequencing technology's ability to generate massive sequence coverage, which helps to reduce errors in the initial sequencing process and reveals the natural variation among the genomes of a virus population, we must also develop new annotation strategies such as how to annotate SNPs in virus genomes and gene fragments in the non-essential gene sets of large viruses.
T157	48054-48164	Epistemic_statement	denotes	The goal of this type of annotation is to provide the virologist with more information in comparative studies.
T158	48299-48443	Epistemic_statement	denotes	The questioner is really asking about the presence of functional genes, but would likely be interested in knowing the mechanism of the deletion.
T159	48444-48592	Epistemic_statement	denotes	For example, whether the gene was inactivated by a series of deletions or a single nucleotide change that could be the result of a sequencing error.
T160	48593-48910	Epistemic_statement	denotes	Not only is the number of genomes requiring annotation increasing exponentially, but the nature of the data is also changing with metagenomics' ability to determine all of the nucleotide sequences from a crude environmental sample without the need to culture microbes in the laboratory or probe for a particular gene.
T161	48911-49126	Epistemic_statement	denotes	Since many organisms fail to grow in the laboratory, we cannot grow viruses that are unique to them; thus, metagenomics provides a valuable window into the previously unknown diversity of viruses in the environment.
T162	49127-49264	Epistemic_statement	denotes	Even with large-scale environmental sequencing under way, most metagenomic sequences appear to be unrelated to currently known sequences.
T163	49365-49452	Epistemic_statement	denotes	However, metagenomics also comes with its own list of special problems and limitations.
T164	49549-49675	Epistemic_statement	denotes	Genomic assembly can be difficult because the metagenomic short reads often provide less coverage than traditional sequencing.
T165	49881-50027	Epistemic_statement	denotes	Although assembly against a reference genome is helpful, this approach can't be used with environmental samples full of unknown microbial species.
T166	50168-50312	Epistemic_statement	denotes	Furthermore, metadata is often left out of annotations, yet it is required for the data to be put into a useful, comparable, biological context.
T167	50483-50606	Epistemic_statement	denotes	Thus, it is crucial that annotation of metadata be included as a mandatory and standardized component of genome submission.
T168	50607-50913	Epistemic_statement	denotes	With the accumulation of more and more diverse sequences from metagenomic sequencing of environmental samples, the organization of gene/protein sequences into Clusters of Orthologous groups (COGs; http://www.ncbi.nlm.nih.gov/COG/) and Viral COGs (VOGs) becomes especially useful for defining relationships.
T169	50914-51260	Epistemic_statement	denotes	Even when matches between orthologs may be at the limit of detection, the greater the number of sequences in a cluster, the greater the chance of making new matches by virtue of transitive sequence comparisons, that is SeqA matches SeqB, but not SeqC; if SeqB also matches SeqC then a transitive relationship between SeqA and SeqC is established.
T170	51678-52274	Epistemic_statement	denotes	Several features of this updated version may help increase the knowledge and use of ortholog groups by the virology community: the new web interface is simplified for less experienced users and offers new browsing and visualization capabilities; the new version has improved consistency and impact through incorporation and linking of Gene Ontology terms, KEGG pathways and UniProt and SMART/Pfam domains to ortholog group assignments; and the algorithms and pipelines for assignment are made freely available on the website, encouraging individuals to apply the techniques to their own datasets.
T171	52275-52582	Epistemic_statement	denotes	In recent years, the growth in both the volume and types of data that can be considered bioinformatic in nature has forced the scientific community to consider how much of a role that the manual forms of genome annotation, curation and maintenance can continue to play in assigning knowledge to new genomes.
T172	52583-52771	Epistemic_statement	denotes	A compromise may be found in the maintenance of manually curated reference genomes, and development of programs to aid in increasing the accuracy of automated annotations of new sequences.
T173	52772-52997	Epistemic_statement	denotes	Finally, the question remains whether funding will be made available for tailoring databases to the needs of virology researchers working on the wide array of bioinformatics challenges that these new data sets are generating.

T1

Epistemic_statement

denotes

Discussions will be illustrated with a limited number of virus databases, including our own -the Viral Orthologous Clusters (VOCs) database, which forms the core of Virology.ca that supports researchers working with a variety of large dsDNA viruses.

T2

463-730

Epistemic_statement

denotes

Depending where you are in the world, GenBank, (maintained by the National Center for Biotechnology Information, NCBI, USA), the EMBL Data Library (maintained in the UK) or the DNA Databank of Japan may first come to mind when thinking about bioinformatics databases.

T3

731-854

Epistemic_statement

denotes

However, these large repositories of genomic, and other, sequence information have a relatively simple flat-file structure.

T4

855-1034

Epistemic_statement

denotes

When submitting a viral genome sequence to one these databases, the sequence data must be annotated with various information including the positions of coding DNA sequences (CDS).

T5

1035-1133

Epistemic_statement

denotes

Flat-file databases can only be queried for information that is explicitly stated within the file.

T6

1134-1255

Epistemic_statement

denotes

For example, queries can be made by keywords in the protein name, but not derivative characteristics such as gene length.

T7

1659-1816

Epistemic_statement

denotes

When building such a relational database, the designer must define the structure of all of the data records to be stored and their association to each other.

T8

1817-2016

Epistemic_statement

denotes

The database schema illustrates these relationships and acts as a map to the organization of the data storage locations, that is specific containers (dataset tables) for storing each type of dataset.

T9

2320-2441

Epistemic_statement

denotes

Various file formats aim to capture this viral sequence data and associated knowledge, including GenBank and XML formats.

T10

2553-2651

Epistemic_statement

denotes

Similarly, these files can be worked with using standard text editors or word-processing programs.

T11

2652-2842

Epistemic_statement

denotes

A downside is that minor changes to the format of GenBank files can cause software tools that read these files (parsers) to fail if they have not been programmed to handle these differences.

T12

3153-3307

Epistemic_statement

denotes

Although XML files are excellent for representing relationships in a form that can be read by a computer, they are more difficult for users to understand.

T13

3596-3823

Epistemic_statement

denotes

However, if a database is to be useful to the virology research community, then there must be at least one additional component in this system -an easy to use interface to enable a virologist to interact with the database, i.e.

T14

3903-3974

Epistemic_statement

denotes

Furthermore, the database itself may form part of an analysis pipeline.

T15

4145-4386

Epistemic_statement

denotes

Rather than the database returning a file of the sequences to be formatted before submission to a program such as ClustalO or MUSCLE for alignment, it might be desirable to have the database send the sequences directly to the alignment tool.

T16

4387-4535

Epistemic_statement

denotes

The tool can then run the desired function before returning the processed data within a visualization tool that simplifies editing of the final MSA.

T17

4536-4690

Epistemic_statement

denotes

Although simplification of this process with default parameters is clearly valuable to the user, over-simplification can lead to restricted function, i.e.

T18

4739-4966

Epistemic_statement

denotes

Therefore, it is valuable to provide a variety of export options allowing users to obtain formats that can be used directly with multiple analysis tools and permit full range of parameter access, as discussed in later sections.

T19

4967-5099

Epistemic_statement

denotes

Design of the user interface with potential queries and flexibility of data retrieval in mind is key to the utility of the database.

T20

5238-5461

Epistemic_statement

denotes

Generally, the databases are optimized for two main functionalities: 1) storing and updating structured data with high integrity and 2) providing tools for searching, retrieving, summarizing and possibly analyzing the data.

T21

5753-5863

Epistemic_statement

denotes

The relationships between these tables are explicitly defined and are not limited to one-to-one relationships.

T22

6010-6125

Epistemic_statement

denotes

An additional aspect of any discussion on database design is maintenance, which can be viewed from several aspects.

T23

6481-6620

Epistemic_statement

denotes

A second view of maintenance, which is more integral to database design, concerns the data: 1) how will data be imported into the database?

T24

6621-6672

Epistemic_statement

denotes

2) will data need to be edited (updated/corrected)?

T25

6673-6757

Epistemic_statement

denotes

Again, these questions have a natural division along the lines of virus genome type.

T26

6758-7003

Epistemic_statement

denotes

For example, some form of automated process is essential to import the thousands of genomes required for an influenza or HCV (small genomes, fixed gene complement) resource, but these small sequences are unlikely to need subsequent modification.

T27

7004-7269

Epistemic_statement

denotes

In contrast, a resource such as Virology.ca, which deals with large poxviruses, needs to contend with far less genomes, but must handle changes to genome annotation both before and after import into the database due to large, less well characterized, dsDNA genomes.

T28

7270-7502

Epistemic_statement

denotes

Therefore, while a software script to automatically collect new genome GenBank files and insert them into a database might be feasible for an influenza virus database, this process cannot be used for poxvirus genomes in Virology.ca.

T29

7503-7606

Epistemic_statement

denotes

The complexity of the latter frequently leads to variations (sometimes errors) in annotation protocols.

T30

7607-7715

Epistemic_statement

denotes

Furthermore, new discoveries surrounding gene function mean that old genomes must sometimes be re-annotated.

T31

7716-7907

Epistemic_statement

denotes

Therefore, key to the design of database systems like Virology.ca is the inclusion of a manual, user-friendly database-editing tool that will allow virologists to enter and maintain the data.

T32

7908-8022

Epistemic_statement

denotes

Another frequent need for database editing in Virology.ca results from the assignment of genes to ortholog groups.

T33

8353-8595

Epistemic_statement

denotes

Although orthology is simply the prediction of a common ancestor between genes of different species, the extreme diversity among viruses in a single taxonomic family creates difficulties in the accurate assignment of genes to ortholog groups.

T34

9098-9203

Epistemic_statement

denotes

Unfortunately, this work is sometimes viewed as a non-discovery based science and funding is problematic.

T35

9380-9561

Epistemic_statement

denotes

Although many bioinformatics databases may be in some way relevant to viruses, the majority of virologists tend to think of, and also use most frequently, genome sequence databases.

T36

9562-9668

Epistemic_statement

denotes

As a result, it would be hard to find a virologist who hasn't performed a BLAST search of these databases.

T37

9669-9892

Epistemic_statement

denotes

Similarly, any virologist who has partially or completely sequenced a viral genome is familiar with the process of annotating the sequence for submission to GenBank (or local equivalent) before their paper can be published.

T38

9893-10189

Epistemic_statement

denotes

Three levels of annotated information exist: basic information within a GenBank file that can be directly transferred into databases; values calculated or predicted from the basic data; and curated information, beyond that provided when first developing the GenBank file, added manually by users.

T39

10190-10429

Epistemic_statement

denotes

The GenBank sequence file usually contains basic virus identification information, affiliations of the annotator, CDS locations within the genome, metadata, gene and product type/names, as well as mRNA splicing information, if appropriate.

T40

10549-10747

Epistemic_statement

denotes

However, the sequence data itself can be used to calculate various values that might be required by the end-user such as A + T content, protein pI, molecular mass of proteins and amino acid content.

T41

10748-10928

Epistemic_statement

denotes

Additionally, the data may be used as input for predictive tools, such as those that look for functional motifs that are associated with enzymatic function or protein localization.

T42

10929-10985

Epistemic_statement

denotes

Manual curation can also provide additional information.

T43

10986-11207

Epistemic_statement

denotes

For example, there is generally no information describing the likely reliability of the annotations or the source of the information (experimental data, software prediction or scientific literature) within a GenBank file.

T44

11277-11431

Epistemic_statement

denotes

Unfortunately, it is rare that this kind of supplementary information is incorporated into databases in a manner that can be searched in a meaningful way.

T45

11432-11610

Epistemic_statement

denotes

Keeping record of the reliability of annotations is especially important when investigating the genes of unknown function, frequently found within the varied large dsDNA viruses.

T46

11611-11729

Epistemic_statement

denotes

Functional assignment of the associated proteins relies heavily on the transfer of function from orthologous proteins.

T47

11730-11951

Epistemic_statement

denotes

Orthologous proteins, which may only be 20-30% identical, often do not match the same set of functional motifs, and vary in their ability to match distantly-related proteins that might give important clues as to function.

T48

12074-12175

Epistemic_statement

denotes

As noted above, accurate genome annotation is critical to providing a useful bioinformatics database.

T49

12176-12278

Epistemic_statement

denotes

However, this must be built upon a foundation of accurate, and preferably complete, genome sequencing.

T50

12279-12373

Epistemic_statement

denotes

The complexity of the annotation problem increases exponentially with genome size and novelty.

T51

12558-12968

Epistemic_statement

denotes

Large viruses, such as these, bring two additional complications to the annotation problem: 1) only a subset of the viral genes are essential, so some orthologs may be functional in one species but fragmented in another and, 2) some of the more diverse viruses have > 100 genes currently annotated as "hypothetical", and even the DNA polymerase protein may share only 32% aa identity with its closest relative.

T52

13125-13370

Epistemic_statement

denotes

Without the use of either a closely related genome, or multiple more distant relatives with similar gene sets, for reference genomes in annotation, it is very difficult to predict which ORFs are: true genes, gene fragments, or random small ORFs.

T53

13486-13623

Epistemic_statement

denotes

Although some groups share common physical virion structures, often only a few genes are found in common between related groups of phage.

T54

13783-13870

Epistemic_statement

denotes

Thus, the complexity of annotation relates directly to viral genome size and diversity.

T55

13993-14121

Epistemic_statement

denotes

Additionally, this process can often be fully automated as part of a pipeline to import newly sequenced genomes into a database.

T56

14279-14542

Epistemic_statement

denotes

The IRD pipeline aligns input nucleotide sequences against a consensus sequence profile to "identify possible sequencing errors, determine the influenza type, segment number, and for segments 4 and 6 of type A, the subtype, and translate the nucleotide sequence".

T57

15226-15430

Epistemic_statement

denotes

Both partial matches and potential novel genes are flagged in the GATU interface for the annotator to investigate further and make a final decision; tools are included in GATU to make this process easier.

T58

15550-15722

Epistemic_statement

denotes

The majority of these tools depend upon similarity searches against previously characterized genes, which can result in underannotation of unique or highly divergent genes.

T59

15723-15832

Epistemic_statement

denotes

For this reason, the use of a general list of genes may be more useful than using a specific reference virus.

T60

15833-16067

Epistemic_statement

denotes

Some annotation tools, especially for prokaryotic and eukaryotic genomes, attempt gene prediction -searching for promoter-like sequences and other gene characteristics, however, these are less useful for the annotation of viral genes.

T61

16259-16345

Epistemic_statement

denotes

This is invaluable in the annotation of large and complex viruses, such as poxviruses.

T62

16682-16795

Epistemic_statement

denotes

Although this change might alter a protein's function, most annotators would probably annotate this altered gene.

T63

16938-17218

Epistemic_statement

denotes

In this example the greatly shortened protein is very unlikely to be functional, however, because the second part of the gene is still present and represents 70% of the original gene this is sometimes annotated (Cii) even though it is very unlikely to be translated from the mRNA.

T64

17349-17444

Epistemic_statement

denotes

Although the protein product is not changed, the larger ORF is sometimes erroneously annotated.

T65

17445-17595

Epistemic_statement

denotes

These annotation errors illustrate the need for more complex automated evaluation of annotation results or the continued input of the human annotator.

T66

17596-17771

Epistemic_statement

denotes

Although the various databases and associated sequence searching/analysis tools comprise a very valuable set of resources, it is also important to recognize their limitations.

T67

17772-17931

Epistemic_statement

denotes

Researchers need to remain wary of the various types of information because regardless of the source, errors and out-of-date files can fall through the cracks.

T68

18086-18220

Epistemic_statement

denotes

If available, reference genomes can be very valuable and be used to predict/correct potential errors either automatically or manually.

T69

18424-18599

Epistemic_statement

denotes

However, many of the genes present in the large DNA viruses such as poxviruses are non-essential; they can be, and often are, truncated in some viruses and complete in others.

T70

18600-18697

Epistemic_statement

denotes

Thus, researchers must remain aware of potential errors, regardless of the source or genome type.

T71

18941-19059

Epistemic_statement

denotes

• Errors of ignorance -a simple lack of knowledge: ortholog/paralog relationships not understood or viruses mis-named.

T72

19295-19386

Epistemic_statement

denotes

Another limitation of most databases is the absence of evidence for particular annotations.

T73

19497-19637

Epistemic_statement

denotes

However, even when evidence notes are present, there may be problems since experiments may eventually be discredited and/or require updates.

T74

19638-19758

Epistemic_statement

denotes

Curating such a system, essentially with the annotations as a living document, would also be incredibly labor intensive.

T75

19759-19965

Epistemic_statement

denotes

The lack of an unambiguous and controlled naming standard that is carried across all viruses and databases results in variable descriptions and data that are difficult to query for specific characteristics.

T76

19966-20139

Epistemic_statement

denotes

For example, software may not be programmed to recognise synonymous terms such as "ssDNA," "single strand DNA," "single-stranded DNA," and "single stranded DNA" as the same.

T77

20140-20288

Epistemic_statement

denotes

Where possible, import systems should have locks on permitted words, based on a universal, controlled vocabulary, in order to reduce this ambiguity.

T78

20668-20877

Epistemic_statement

denotes

In addition, although some terms will still be applicable to only particular viruses (eg, the phages, see viral head-tail joining; GO:0098005), GO viral terms are designed to be species-neutral where possible.

T79

20878-21064

Epistemic_statement

denotes

However, complete implementation of a consistent and accurate vocabulary, such as that presented by GO, will only work if scientists choose to sustain it through their own participation.

T80

21220-21361

Epistemic_statement

denotes

Although the quality of raw data, annotations, and database structure is clearly important, a database is only as useful as its search tools.

T81

21362-21536

Epistemic_statement

denotes

The database must be able to execute the questions posed by virologists, which for user convenience is often accomplished through the use of a Graphical User Interface (GUI).

T82

21537-21743

Epistemic_statement

denotes

Although it is impossible to predict all of the queries that might be requested, the system should be flexible enough to provide a reasonably close search that may require minor post-search data processing.

T83

21878-22134

Epistemic_statement

denotes

For example, in a study examining the genomic variation of H1N1 influenza viruses obtained from humans between the years of 2000-2010, the researchers would have specific search parameters regarding influenza A subtype, host species, and year of isolation.

T84

22135-22403

Epistemic_statement

denotes

If the search interface did not permit all of these search parameters, then the researcher would be left with an arduous task of manually sorting the results that, depending upon the volume of information and computer skills of the user, may be far too time consuming.

T85

22404-22614

Epistemic_statement

denotes

Another aspect of the searching, which is virus-specific, is that for viruses with high sequencing volumes at similar times and locations, such as influenza, there may be many identical genomes in the database.

T86

22840-23106

Epistemic_statement

denotes

Although it would be preferable to have database resources supported long term so that they can respond to users requests for new queries etc., it is not necessarily an efficient use of resources to build every feature requested by users into the database interface.

T87

23107-23294

Epistemic_statement

denotes

Clearly a cost-benefit analysis must be performed on the requests for features to be included in the software so that money and effort can be targeted at the most-used database functions.

T88

23295-23477

Epistemic_statement

denotes

However, the system must be able to provide users that have one-off analyses some basic filters to work with, while users must accept that it may off-load some data analysis to them.

T89

23664-23846

Epistemic_statement

denotes

FASTA formatting of nucleotide and protein sequences is a standard because multiple sequences can be incorporated into one file, and they can be read by many bioinformatics programs.

T90

23847-23905

Epistemic_statement

denotes

However, most annotations are stripped out of these files.

T91

23906-24048

Epistemic_statement

denotes

In contrast, GenBank files contain gene annotations, but these files can be tricky for software to read due to non-standard formatting errors.

T92

24049-24223

Epistemic_statement

denotes

The most basic output format is comma-separated values (csv), a tabular output that can often be read by other software, or even spreadsheet programs such as Microsoft Excel.

T93

24224-24295

Epistemic_statement

denotes

Yet, for many researchers data in these formats is very tedious to use.

T94

24296-24461

Epistemic_statement

denotes

Therefore, databases are frequently paired with visualization and analysis tools, or permit the export of data in a format that can be accepted by external programs.

T95

24462-24605

Epistemic_statement

denotes

For example, the Virology.ca database, VOCs, is linked to the Base-By-Base (BBB) visualization tool, which displays and allows editing of MSAs.

T96

24918-25011

Epistemic_statement

denotes

Databases are commonly divided by data type; however, this is not as simple as it might seem.

T97

25012-25163

Epistemic_statement

denotes

For example, a series of databases, each dedicated to a family of viruses, might all need to support many different types of molecular biological data.

T98

25164-25317

Epistemic_statement

denotes

Alternatively, databases managing a particular data type (eg, sequence or virion structure) would need to deal with many taxonomically different viruses.

T99

25378-25581

Epistemic_statement

denotes

Although this article aims to provide an updated overview of the biological databases relevant to viruses, as a print publication it is important to note that this resource will become quickly out-dated.

T100

25944-26097

Epistemic_statement

denotes

Alternatively, Google may be your friend, but it will help if the name of the database is not easily confused with a variety of other Internet resources.

T101

26098-26280

Epistemic_statement

denotes

The Virus Pathogen Resource (ViPR, pronounced viper) is particularly tricky to find if you don't know it's a Bioinformatics Resource Center and therefore you should look for viprbrc.

T102

26281-26429

Epistemic_statement

denotes

The search is further complicated by the existence of VIPERdb, a separate database of virus capsid structures and the VIPRE Antivirus software tool.

T103

26430-26560

Epistemic_statement

denotes

Although this review focuses on virus databases, most of these are reliant on other generic databases as the source of their data.

T104

27017-27141

Epistemic_statement

denotes

This provides up-to-date public access to nucleotide sequence data that can be accessed through any of the three interfaces.

T105

27542-27751

Epistemic_statement

denotes

This information helps in the reproducibility of genome assemblies in cases requiring review, and allows the user to analyse whether unexpected results are characteristic of the virus or a systematic artifact.

T106

27752-27957

Epistemic_statement

denotes

However, advances in sequencing technology raise the question of the value of saving raw sequencing data; re-sequencing of samples (given their availability) is fast, easy, cheap and increasingly accurate.

T107

28056-28222

Epistemic_statement

denotes

Therefore, if sequence errors are detected or new genes discovered within a genome by other research groups, it might be impossible to update the original submission.

T108

28223-28407

Epistemic_statement

denotes

To deal with this problem, NCBI has created the Reference Sequences resource (RefSeq), which offers up-to-date reference genomes for taxonomically diverse organisms, including viruses.

T109

28692-28798

Epistemic_statement

denotes

RefSeq files are not limited to nucleotide sequences, but also offer transcript and protein sequence data.

T110

29079-29731

Epistemic_statement

denotes

The database is divided into several branches: UniProtKB, the protein knowledgebase with two subsections, TrEMBL (translated EMBL Nucleotide sequence data library) that stores automatically annotated proteins prior to review and Swiss-Prot, containing proteins that have been manually annotated and reviewed, and often have associated literature; UniParc functions as an archive, sorting new, revised and obsolete sequences with a non-redundant numbering scheme allowing outdated UniProt references from past literature to be traceable; and UniRef100, 90 and 50 branches that cluster proteins into groups of 100%, 90% and 50% aa identity, respectively.

T111

29732-29937

Epistemic_statement

denotes

To ensure up-to-date public access, new protein sequences must be submitted to UniProt prior to publication, the new protein sequences associated with a genome submitted to GenBank are automatically added.

T112

29938-30129

Epistemic_statement

denotes

Knowledge of a protein's 3D structure can assist in the prediction of its functions and interactions, which are important aspects of understanding viral processes and drug and vaccine-design.

T113

30130-30245

Epistemic_statement

denotes

As a result, 3D structures have been determined for many viral proteins, and in some cases, of the complete virion.

T114

30246-30375

Epistemic_statement

denotes

The Protein Data Bank (PDB) collects all biochemical structures, but searches can be limited to the structures of viral proteins.

T115

30749-31325

Epistemic_statement

denotes

These include: VIPERdb, which is maintained at the Scripps Research Institute and is a database for icosahedral virus capsid structures; The Big Picture Book of Viruses, a large catalog of virus pictures with associated information; Virus World, a summary of pictures and links available from PDB and VIPERdb organized by virus name; and Viral Protein Structure Resource (ViPs), a database that aims to provide a central source for all viral protein structures and provides a genome map feature allowing the user to determine which genes have structural information available.

T116

31629-31836

Epistemic_statement

denotes

Several of these will be discussed below; however, as an aside, there are also aspects of the standard databases that provide useful compartmentalization of data to make working with specific viruses easier.

T117

32503-32655

Epistemic_statement

denotes

Understanding these differences can help categorize the various virus-specific databases, and address which types of information can be drawn from each.

T118

32736-32869

Epistemic_statement

denotes

At the most basic level, a website, can be used to present a collection of links to sequences in files or more traditional databases.

T119

33005-33155

Epistemic_statement

denotes

Although these resources usually deal with small data sets, they can help offer a more visually appealing and manageable access point for researchers.

T120

33156-33327

Epistemic_statement

denotes

Although the same files can be easily stored on a local desktop computer, there is an important benefit to accessing from a website -the data should be the same/unaltered.

T121

33328-33506

Epistemic_statement

denotes

Since most researchers tend to collect dozens of versions of sequences, often with various edits, accessing data from a website ensures the sequence is what it is supposed to be.

T122

34030-34165

Epistemic_statement

denotes

For example, when NCBI BLAST searches the sequence databases, it is possible to filter the results by keywords and taxonomy categories.

T123

34783-34980

Epistemic_statement

denotes

Examples of this genome-related data could include: G + C% content of genes and genomes, codon composition of genes, pI of proteins, predicted MW of proteins and amino acid composition of proteins.

T124

35326-35663

Epistemic_statement

denotes

The remainder of this section will compare and contrast three types of virus database resources: the Influenza Virus Resource at NCBI; the Virus Pathogen Resource Bioinformatics Resource Center (ViPRbrc) supported by the NIH, which supports a variety of viruses; and the author's Virology.ca resource, which supports large dsDNA viruses.

T125

35879-36050

Epistemic_statement

denotes

The search interface supports the query of sequences (protein, CDS or nucleotide) by influenza type, genome segment, serotype, collection date, host and country of origin.

T126

36051-36209

Epistemic_statement

denotes

Additional filters can remove sequences that are not full length (of segment), that are not part of a full genome set, or that are identical to another virus.

T127

36210-36346

Epistemic_statement

denotes

In addition, pandemic H1N1 sequences can be included or excluded, as can sequences from lab strains, vaccines or certain other projects.

T128

36347-36508

Epistemic_statement

denotes

Once selected, sequences can be exported to a local computer or aligned with subsequent generation of a phylogenetic tree; the interface is completely web-based.

T129

36794-36952

Epistemic_statement

denotes

ViPRbrc is funded by NIH to support research on viruses on the NIAID Category A-C Priority Pathogen lists, and those causing (re)emerging infectious diseases.

T130

37695-37814

Epistemic_statement

denotes

Although the database selection, retrieval and analysis tools are still web-based, they are comprehensive and advanced.

T131

37815-37936

Epistemic_statement

denotes

Some of the analyses provide graphical output (eg, genome maps), but have limited comparative tools for further analysis.

T132

38105-38287

Epistemic_statement

denotes

This allows the user to store a variety of retrieved sequence sets, eliminating the need to repeat the tedious selection process, which in turn likely increases accuracy of the work.

T133

38515-38664

Epistemic_statement

denotes

Data can also be exported from the database in a variety of formats, including GFF3, which is a tab-delimited format for describing genomic features.

T134

38942-39256

Epistemic_statement

denotes

To this aim, ViPRbrc tools allow BLAST searches of custom (virus-specific) databases and the creation and visualization of MSAs and phylogenetic trees, and one of the more valuable services provided is the tool for Analysis of Sequence Variation, that is SNPs (also amino acid variation) along sequences of an MSA.

T135

39257-39457

Epistemic_statement

denotes

In contrast to ViPRbrc, Virology.ca is dedicated to the support of large dsDNA viruses and as noted previously, this involves the provision of different tools to perform comparative genomics analyses.

T136

39458-39668

Epistemic_statement

denotes

Although the genomes are 20 -30 times the size of the RNA viruses, the current poxvirus database at Virology.ca contains less than 270 genomes, with the most studied virus species having 30 -50 representatives.

T137

39819-40000

Epistemic_statement

denotes

However, only 40 are present in all poxvirus genomes, but this conserved set of core genes increases to about 80 if poxviruses that infect insects are excluded from the calculation.

T138

40001-40096

Epistemic_statement

denotes

The remaining non-essential genes are often associated with host range and virulence functions.

T139

40097-40325

Epistemic_statement

denotes

In the Orthopoxvirus genus, the biggest genomes tend to be found in viruses with the widest host range; it is thought that the restriction of viruses to a limited host range is associated with the loss of genes or gene function.

T140

41763-41843

Epistemic_statement

denotes

The program can also be used to view RNA-Seq data and analyze for recombination.

T141

41844-42184

Epistemic_statement

denotes

BBB offers its own format of data storage, an XML file (.bbb), that allows for additional features such as the storage of user comments, primer annotations, and genome MSAs with gene annotations JDotter A program for generating dotplots; suitable for whole genomes, sub-genomes or protein sequences Genome Annotation Transfer Utility (GATU)

T142

42185-42261

Epistemic_statement

denotes

A tool used to annotate genomes based on a closely related reference genome.

T143

42420-42544

Epistemic_statement

denotes

GATU also suggests novel genes to the human annotator, who has last word on the annotation process Sequence Searcher (SSeq):

T144

42545-42645

Epistemic_statement

denotes

An easy-to-use Java tool for searching protein and DNA sequences for user-specified sequence motifs.

T145

42714-43016

Epistemic_statement

denotes

These features are helpful when working with the large DNA viruses because of the uncertainty associated with the annotation of various genes -genes that are predicted to encode small proteins with a very high or very low pI and unusual amino acid composition are likely to represent annotation errors.

T146

43017-43200

Epistemic_statement

denotes

Indeed, many of the features present in Base-By-Base and VGO, the tools that display the genome sequences and genome maps, respectively are devoted to helping solve annotation issues.

T147

43201-43345

Epistemic_statement

denotes

For the large DNA viruses, accurate annotation is very important because a common investigation asks, Why is virus X more virulent than virus Y?

T148

43550-43662

Epistemic_statement

denotes

So far, we have divided discussion issues by genome type; however, two additional databases should be mentioned.

T149

45150-45279

Epistemic_statement

denotes

This information, defining relationships among viruses, aids in the classification and understanding of newly discovered viruses.

T150

45885-46016

Epistemic_statement

denotes

However, there are 78 families of viruses that not assigned to an order, including one called Unassigned, which contains 14 genera.

T151

46269-46477

Epistemic_statement

denotes

Sequences need to be organized to facilitate the process of similarity searching, for example with BLAST, for related sequences and also organized by functional or source (human, rodent, virus) relationships.

T152

46478-46690

Epistemic_statement

denotes

However, it's neither feasible nor affordable to create independent database resources for every organism, therefore resources like ViPRbrc that can function with a variety of viruses may become more commonplace.

T153

46691-46862

Epistemic_statement

denotes

With respect to genome sequencing, one of the greatest problems lies with the huge volume of raw data (sequencing reads) that is associated with any final genome sequence.

T154

46863-47098

Epistemic_statement

denotes

The authors recently received 3 GB of compressed sequencing data, a mix of host and virus sequences, to assemble a 150 kb poxvirus genome; it is estimated that genomic data will soon become the world's largest consumer of disk storage.

T155

47456-47646

Epistemic_statement

denotes

Perhaps, in the not too distant future, it will not be unusual to have our own genome sequenced, as well as our various organ microbiomes and the genomes of any pathogens we are infected by.

T156

47647-48053

Epistemic_statement

denotes

To deal with the benefits of the new sequencing technology's ability to generate massive sequence coverage, which helps to reduce errors in the initial sequencing process and reveals the natural variation among the genomes of a virus population, we must also develop new annotation strategies such as how to annotate SNPs in virus genomes and gene fragments in the non-essential gene sets of large viruses.

T157

48054-48164

Epistemic_statement

denotes

The goal of this type of annotation is to provide the virologist with more information in comparative studies.

T158

48299-48443

Epistemic_statement

denotes

The questioner is really asking about the presence of functional genes, but would likely be interested in knowing the mechanism of the deletion.

T159

48444-48592

Epistemic_statement

denotes

For example, whether the gene was inactivated by a series of deletions or a single nucleotide change that could be the result of a sequencing error.

T160

48593-48910

Epistemic_statement

denotes

Not only is the number of genomes requiring annotation increasing exponentially, but the nature of the data is also changing with metagenomics' ability to determine all of the nucleotide sequences from a crude environmental sample without the need to culture microbes in the laboratory or probe for a particular gene.

T161

48911-49126

Epistemic_statement

denotes

Since many organisms fail to grow in the laboratory, we cannot grow viruses that are unique to them; thus, metagenomics provides a valuable window into the previously unknown diversity of viruses in the environment.

T162

49127-49264

Epistemic_statement

denotes

Even with large-scale environmental sequencing under way, most metagenomic sequences appear to be unrelated to currently known sequences.

T163

49365-49452

Epistemic_statement

denotes

However, metagenomics also comes with its own list of special problems and limitations.

T164

49549-49675

Epistemic_statement

denotes

Genomic assembly can be difficult because the metagenomic short reads often provide less coverage than traditional sequencing.

T165

49881-50027

Epistemic_statement

denotes

Although assembly against a reference genome is helpful, this approach can't be used with environmental samples full of unknown microbial species.

T166

50168-50312

Epistemic_statement

denotes

Furthermore, metadata is often left out of annotations, yet it is required for the data to be put into a useful, comparable, biological context.

T167

50483-50606

Epistemic_statement

denotes

Thus, it is crucial that annotation of metadata be included as a mandatory and standardized component of genome submission.

T168

50607-50913

Epistemic_statement

denotes

With the accumulation of more and more diverse sequences from metagenomic sequencing of environmental samples, the organization of gene/protein sequences into Clusters of Orthologous groups (COGs; http://www.ncbi.nlm.nih.gov/COG/) and Viral COGs (VOGs) becomes especially useful for defining relationships.

T169

50914-51260

Epistemic_statement

denotes

Even when matches between orthologs may be at the limit of detection, the greater the number of sequences in a cluster, the greater the chance of making new matches by virtue of transitive sequence comparisons, that is SeqA matches SeqB, but not SeqC; if SeqB also matches SeqC then a transitive relationship between SeqA and SeqC is established.

T170

51678-52274

Epistemic_statement

denotes

Several features of this updated version may help increase the knowledge and use of ortholog groups by the virology community: the new web interface is simplified for less experienced users and offers new browsing and visualization capabilities; the new version has improved consistency and impact through incorporation and linking of Gene Ontology terms, KEGG pathways and UniProt and SMART/Pfam domains to ortholog group assignments; and the algorithms and pipelines for assignment are made freely available on the website, encouraging individuals to apply the techniques to their own datasets.

T171

52275-52582

Epistemic_statement

denotes

In recent years, the growth in both the volume and types of data that can be considered bioinformatic in nature has forced the scientific community to consider how much of a role that the manual forms of genome annotation, curation and maintenance can continue to play in assigning knowledge to new genomes.

T172

52583-52771

Epistemic_statement

denotes

A compromise may be found in the maintenance of manually curated reference genomes, and development of programs to aid in increasing the accuracy of automated annotations of new sequences.

T173

52772-52997

Epistemic_statement

denotes

Finally, the question remains whether funding will be made available for tailoring databases to the needs of virology researchers working on the wide array of bioinformatics challenges that these new data sets are generating.

CORD-19:03476153a6c18a85c98721c557f35b1a09755d14 JSON TXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

CORD-19:03476153a6c18a85c98721c557f35b1a09755d14 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

CORD-19:03476153a6c18a85c98721c557f35b1a09755d14 JSON TXT 9 Projects