Id |
Subject |
Object |
Predicate |
Lexical cue |
T1 |
213-462 |
Epistemic_statement |
denotes |
Discussions will be illustrated with a limited number of virus databases, including our own -the Viral Orthologous Clusters (VOCs) database, which forms the core of Virology.ca that supports researchers working with a variety of large dsDNA viruses. |
T2 |
463-730 |
Epistemic_statement |
denotes |
Depending where you are in the world, GenBank, (maintained by the National Center for Biotechnology Information, NCBI, USA), the EMBL Data Library (maintained in the UK) or the DNA Databank of Japan may first come to mind when thinking about bioinformatics databases. |
T3 |
731-854 |
Epistemic_statement |
denotes |
However, these large repositories of genomic, and other, sequence information have a relatively simple flat-file structure. |
T4 |
855-1034 |
Epistemic_statement |
denotes |
When submitting a viral genome sequence to one these databases, the sequence data must be annotated with various information including the positions of coding DNA sequences (CDS). |
T5 |
1035-1133 |
Epistemic_statement |
denotes |
Flat-file databases can only be queried for information that is explicitly stated within the file. |
T6 |
1134-1255 |
Epistemic_statement |
denotes |
For example, queries can be made by keywords in the protein name, but not derivative characteristics such as gene length. |
T7 |
1659-1816 |
Epistemic_statement |
denotes |
When building such a relational database, the designer must define the structure of all of the data records to be stored and their association to each other. |
T8 |
1817-2016 |
Epistemic_statement |
denotes |
The database schema illustrates these relationships and acts as a map to the organization of the data storage locations, that is specific containers (dataset tables) for storing each type of dataset. |
T9 |
2320-2441 |
Epistemic_statement |
denotes |
Various file formats aim to capture this viral sequence data and associated knowledge, including GenBank and XML formats. |
T10 |
2553-2651 |
Epistemic_statement |
denotes |
Similarly, these files can be worked with using standard text editors or word-processing programs. |
T11 |
2652-2842 |
Epistemic_statement |
denotes |
A downside is that minor changes to the format of GenBank files can cause software tools that read these files (parsers) to fail if they have not been programmed to handle these differences. |
T12 |
3153-3307 |
Epistemic_statement |
denotes |
Although XML files are excellent for representing relationships in a form that can be read by a computer, they are more difficult for users to understand. |
T13 |
3596-3823 |
Epistemic_statement |
denotes |
However, if a database is to be useful to the virology research community, then there must be at least one additional component in this system -an easy to use interface to enable a virologist to interact with the database, i.e. |
T14 |
3903-3974 |
Epistemic_statement |
denotes |
Furthermore, the database itself may form part of an analysis pipeline. |
T15 |
4145-4386 |
Epistemic_statement |
denotes |
Rather than the database returning a file of the sequences to be formatted before submission to a program such as ClustalO or MUSCLE for alignment, it might be desirable to have the database send the sequences directly to the alignment tool. |
T16 |
4387-4535 |
Epistemic_statement |
denotes |
The tool can then run the desired function before returning the processed data within a visualization tool that simplifies editing of the final MSA. |
T17 |
4536-4690 |
Epistemic_statement |
denotes |
Although simplification of this process with default parameters is clearly valuable to the user, over-simplification can lead to restricted function, i.e. |
T18 |
4739-4966 |
Epistemic_statement |
denotes |
Therefore, it is valuable to provide a variety of export options allowing users to obtain formats that can be used directly with multiple analysis tools and permit full range of parameter access, as discussed in later sections. |
T19 |
4967-5099 |
Epistemic_statement |
denotes |
Design of the user interface with potential queries and flexibility of data retrieval in mind is key to the utility of the database. |
T20 |
5238-5461 |
Epistemic_statement |
denotes |
Generally, the databases are optimized for two main functionalities: 1) storing and updating structured data with high integrity and 2) providing tools for searching, retrieving, summarizing and possibly analyzing the data. |
T21 |
5753-5863 |
Epistemic_statement |
denotes |
The relationships between these tables are explicitly defined and are not limited to one-to-one relationships. |
T22 |
6010-6125 |
Epistemic_statement |
denotes |
An additional aspect of any discussion on database design is maintenance, which can be viewed from several aspects. |
T23 |
6481-6620 |
Epistemic_statement |
denotes |
A second view of maintenance, which is more integral to database design, concerns the data: 1) how will data be imported into the database? |
T24 |
6621-6672 |
Epistemic_statement |
denotes |
2) will data need to be edited (updated/corrected)? |
T25 |
6673-6757 |
Epistemic_statement |
denotes |
Again, these questions have a natural division along the lines of virus genome type. |
T26 |
6758-7003 |
Epistemic_statement |
denotes |
For example, some form of automated process is essential to import the thousands of genomes required for an influenza or HCV (small genomes, fixed gene complement) resource, but these small sequences are unlikely to need subsequent modification. |
T27 |
7004-7269 |
Epistemic_statement |
denotes |
In contrast, a resource such as Virology.ca, which deals with large poxviruses, needs to contend with far less genomes, but must handle changes to genome annotation both before and after import into the database due to large, less well characterized, dsDNA genomes. |
T28 |
7270-7502 |
Epistemic_statement |
denotes |
Therefore, while a software script to automatically collect new genome GenBank files and insert them into a database might be feasible for an influenza virus database, this process cannot be used for poxvirus genomes in Virology.ca. |
T29 |
7503-7606 |
Epistemic_statement |
denotes |
The complexity of the latter frequently leads to variations (sometimes errors) in annotation protocols. |
T30 |
7607-7715 |
Epistemic_statement |
denotes |
Furthermore, new discoveries surrounding gene function mean that old genomes must sometimes be re-annotated. |
T31 |
7716-7907 |
Epistemic_statement |
denotes |
Therefore, key to the design of database systems like Virology.ca is the inclusion of a manual, user-friendly database-editing tool that will allow virologists to enter and maintain the data. |
T32 |
7908-8022 |
Epistemic_statement |
denotes |
Another frequent need for database editing in Virology.ca results from the assignment of genes to ortholog groups. |
T33 |
8353-8595 |
Epistemic_statement |
denotes |
Although orthology is simply the prediction of a common ancestor between genes of different species, the extreme diversity among viruses in a single taxonomic family creates difficulties in the accurate assignment of genes to ortholog groups. |
T34 |
9098-9203 |
Epistemic_statement |
denotes |
Unfortunately, this work is sometimes viewed as a non-discovery based science and funding is problematic. |
T35 |
9380-9561 |
Epistemic_statement |
denotes |
Although many bioinformatics databases may be in some way relevant to viruses, the majority of virologists tend to think of, and also use most frequently, genome sequence databases. |
T36 |
9562-9668 |
Epistemic_statement |
denotes |
As a result, it would be hard to find a virologist who hasn't performed a BLAST search of these databases. |
T37 |
9669-9892 |
Epistemic_statement |
denotes |
Similarly, any virologist who has partially or completely sequenced a viral genome is familiar with the process of annotating the sequence for submission to GenBank (or local equivalent) before their paper can be published. |
T38 |
9893-10189 |
Epistemic_statement |
denotes |
Three levels of annotated information exist: basic information within a GenBank file that can be directly transferred into databases; values calculated or predicted from the basic data; and curated information, beyond that provided when first developing the GenBank file, added manually by users. |
T39 |
10190-10429 |
Epistemic_statement |
denotes |
The GenBank sequence file usually contains basic virus identification information, affiliations of the annotator, CDS locations within the genome, metadata, gene and product type/names, as well as mRNA splicing information, if appropriate. |
T40 |
10549-10747 |
Epistemic_statement |
denotes |
However, the sequence data itself can be used to calculate various values that might be required by the end-user such as A + T content, protein pI, molecular mass of proteins and amino acid content. |
T41 |
10748-10928 |
Epistemic_statement |
denotes |
Additionally, the data may be used as input for predictive tools, such as those that look for functional motifs that are associated with enzymatic function or protein localization. |
T42 |
10929-10985 |
Epistemic_statement |
denotes |
Manual curation can also provide additional information. |
T43 |
10986-11207 |
Epistemic_statement |
denotes |
For example, there is generally no information describing the likely reliability of the annotations or the source of the information (experimental data, software prediction or scientific literature) within a GenBank file. |
T44 |
11277-11431 |
Epistemic_statement |
denotes |
Unfortunately, it is rare that this kind of supplementary information is incorporated into databases in a manner that can be searched in a meaningful way. |
T45 |
11432-11610 |
Epistemic_statement |
denotes |
Keeping record of the reliability of annotations is especially important when investigating the genes of unknown function, frequently found within the varied large dsDNA viruses. |
T46 |
11611-11729 |
Epistemic_statement |
denotes |
Functional assignment of the associated proteins relies heavily on the transfer of function from orthologous proteins. |
T47 |
11730-11951 |
Epistemic_statement |
denotes |
Orthologous proteins, which may only be 20-30% identical, often do not match the same set of functional motifs, and vary in their ability to match distantly-related proteins that might give important clues as to function. |
T48 |
12074-12175 |
Epistemic_statement |
denotes |
As noted above, accurate genome annotation is critical to providing a useful bioinformatics database. |
T49 |
12176-12278 |
Epistemic_statement |
denotes |
However, this must be built upon a foundation of accurate, and preferably complete, genome sequencing. |
T50 |
12279-12373 |
Epistemic_statement |
denotes |
The complexity of the annotation problem increases exponentially with genome size and novelty. |
T51 |
12558-12968 |
Epistemic_statement |
denotes |
Large viruses, such as these, bring two additional complications to the annotation problem: 1) only a subset of the viral genes are essential, so some orthologs may be functional in one species but fragmented in another and, 2) some of the more diverse viruses have > 100 genes currently annotated as "hypothetical", and even the DNA polymerase protein may share only 32% aa identity with its closest relative. |
T52 |
13125-13370 |
Epistemic_statement |
denotes |
Without the use of either a closely related genome, or multiple more distant relatives with similar gene sets, for reference genomes in annotation, it is very difficult to predict which ORFs are: true genes, gene fragments, or random small ORFs. |
T53 |
13486-13623 |
Epistemic_statement |
denotes |
Although some groups share common physical virion structures, often only a few genes are found in common between related groups of phage. |
T54 |
13783-13870 |
Epistemic_statement |
denotes |
Thus, the complexity of annotation relates directly to viral genome size and diversity. |
T55 |
13993-14121 |
Epistemic_statement |
denotes |
Additionally, this process can often be fully automated as part of a pipeline to import newly sequenced genomes into a database. |
T56 |
14279-14542 |
Epistemic_statement |
denotes |
The IRD pipeline aligns input nucleotide sequences against a consensus sequence profile to "identify possible sequencing errors, determine the influenza type, segment number, and for segments 4 and 6 of type A, the subtype, and translate the nucleotide sequence". |
T57 |
15226-15430 |
Epistemic_statement |
denotes |
Both partial matches and potential novel genes are flagged in the GATU interface for the annotator to investigate further and make a final decision; tools are included in GATU to make this process easier. |
T58 |
15550-15722 |
Epistemic_statement |
denotes |
The majority of these tools depend upon similarity searches against previously characterized genes, which can result in underannotation of unique or highly divergent genes. |
T59 |
15723-15832 |
Epistemic_statement |
denotes |
For this reason, the use of a general list of genes may be more useful than using a specific reference virus. |
T60 |
15833-16067 |
Epistemic_statement |
denotes |
Some annotation tools, especially for prokaryotic and eukaryotic genomes, attempt gene prediction -searching for promoter-like sequences and other gene characteristics, however, these are less useful for the annotation of viral genes. |
T61 |
16259-16345 |
Epistemic_statement |
denotes |
This is invaluable in the annotation of large and complex viruses, such as poxviruses. |
T62 |
16682-16795 |
Epistemic_statement |
denotes |
Although this change might alter a protein's function, most annotators would probably annotate this altered gene. |
T63 |
16938-17218 |
Epistemic_statement |
denotes |
In this example the greatly shortened protein is very unlikely to be functional, however, because the second part of the gene is still present and represents 70% of the original gene this is sometimes annotated (Cii) even though it is very unlikely to be translated from the mRNA. |
T64 |
17349-17444 |
Epistemic_statement |
denotes |
Although the protein product is not changed, the larger ORF is sometimes erroneously annotated. |
T65 |
17445-17595 |
Epistemic_statement |
denotes |
These annotation errors illustrate the need for more complex automated evaluation of annotation results or the continued input of the human annotator. |
T66 |
17596-17771 |
Epistemic_statement |
denotes |
Although the various databases and associated sequence searching/analysis tools comprise a very valuable set of resources, it is also important to recognize their limitations. |
T67 |
17772-17931 |
Epistemic_statement |
denotes |
Researchers need to remain wary of the various types of information because regardless of the source, errors and out-of-date files can fall through the cracks. |
T68 |
18086-18220 |
Epistemic_statement |
denotes |
If available, reference genomes can be very valuable and be used to predict/correct potential errors either automatically or manually. |
T69 |
18424-18599 |
Epistemic_statement |
denotes |
However, many of the genes present in the large DNA viruses such as poxviruses are non-essential; they can be, and often are, truncated in some viruses and complete in others. |
T70 |
18600-18697 |
Epistemic_statement |
denotes |
Thus, researchers must remain aware of potential errors, regardless of the source or genome type. |
T71 |
18941-19059 |
Epistemic_statement |
denotes |
• Errors of ignorance -a simple lack of knowledge: ortholog/paralog relationships not understood or viruses mis-named. |
T72 |
19295-19386 |
Epistemic_statement |
denotes |
Another limitation of most databases is the absence of evidence for particular annotations. |
T73 |
19497-19637 |
Epistemic_statement |
denotes |
However, even when evidence notes are present, there may be problems since experiments may eventually be discredited and/or require updates. |
T74 |
19638-19758 |
Epistemic_statement |
denotes |
Curating such a system, essentially with the annotations as a living document, would also be incredibly labor intensive. |
T75 |
19759-19965 |
Epistemic_statement |
denotes |
The lack of an unambiguous and controlled naming standard that is carried across all viruses and databases results in variable descriptions and data that are difficult to query for specific characteristics. |
T76 |
19966-20139 |
Epistemic_statement |
denotes |
For example, software may not be programmed to recognise synonymous terms such as "ssDNA," "single strand DNA," "single-stranded DNA," and "single stranded DNA" as the same. |
T77 |
20140-20288 |
Epistemic_statement |
denotes |
Where possible, import systems should have locks on permitted words, based on a universal, controlled vocabulary, in order to reduce this ambiguity. |
T78 |
20668-20877 |
Epistemic_statement |
denotes |
In addition, although some terms will still be applicable to only particular viruses (eg, the phages, see viral head-tail joining; GO:0098005), GO viral terms are designed to be species-neutral where possible. |
T79 |
20878-21064 |
Epistemic_statement |
denotes |
However, complete implementation of a consistent and accurate vocabulary, such as that presented by GO, will only work if scientists choose to sustain it through their own participation. |
T80 |
21220-21361 |
Epistemic_statement |
denotes |
Although the quality of raw data, annotations, and database structure is clearly important, a database is only as useful as its search tools. |
T81 |
21362-21536 |
Epistemic_statement |
denotes |
The database must be able to execute the questions posed by virologists, which for user convenience is often accomplished through the use of a Graphical User Interface (GUI). |
T82 |
21537-21743 |
Epistemic_statement |
denotes |
Although it is impossible to predict all of the queries that might be requested, the system should be flexible enough to provide a reasonably close search that may require minor post-search data processing. |
T83 |
21878-22134 |
Epistemic_statement |
denotes |
For example, in a study examining the genomic variation of H1N1 influenza viruses obtained from humans between the years of 2000-2010, the researchers would have specific search parameters regarding influenza A subtype, host species, and year of isolation. |
T84 |
22135-22403 |
Epistemic_statement |
denotes |
If the search interface did not permit all of these search parameters, then the researcher would be left with an arduous task of manually sorting the results that, depending upon the volume of information and computer skills of the user, may be far too time consuming. |
T85 |
22404-22614 |
Epistemic_statement |
denotes |
Another aspect of the searching, which is virus-specific, is that for viruses with high sequencing volumes at similar times and locations, such as influenza, there may be many identical genomes in the database. |
T86 |
22840-23106 |
Epistemic_statement |
denotes |
Although it would be preferable to have database resources supported long term so that they can respond to users requests for new queries etc., it is not necessarily an efficient use of resources to build every feature requested by users into the database interface. |
T87 |
23107-23294 |
Epistemic_statement |
denotes |
Clearly a cost-benefit analysis must be performed on the requests for features to be included in the software so that money and effort can be targeted at the most-used database functions. |
T88 |
23295-23477 |
Epistemic_statement |
denotes |
However, the system must be able to provide users that have one-off analyses some basic filters to work with, while users must accept that it may off-load some data analysis to them. |
T89 |
23664-23846 |
Epistemic_statement |
denotes |
FASTA formatting of nucleotide and protein sequences is a standard because multiple sequences can be incorporated into one file, and they can be read by many bioinformatics programs. |
T90 |
23847-23905 |
Epistemic_statement |
denotes |
However, most annotations are stripped out of these files. |
T91 |
23906-24048 |
Epistemic_statement |
denotes |
In contrast, GenBank files contain gene annotations, but these files can be tricky for software to read due to non-standard formatting errors. |
T92 |
24049-24223 |
Epistemic_statement |
denotes |
The most basic output format is comma-separated values (csv), a tabular output that can often be read by other software, or even spreadsheet programs such as Microsoft Excel. |
T93 |
24224-24295 |
Epistemic_statement |
denotes |
Yet, for many researchers data in these formats is very tedious to use. |
T94 |
24296-24461 |
Epistemic_statement |
denotes |
Therefore, databases are frequently paired with visualization and analysis tools, or permit the export of data in a format that can be accepted by external programs. |
T95 |
24462-24605 |
Epistemic_statement |
denotes |
For example, the Virology.ca database, VOCs, is linked to the Base-By-Base (BBB) visualization tool, which displays and allows editing of MSAs. |
T96 |
24918-25011 |
Epistemic_statement |
denotes |
Databases are commonly divided by data type; however, this is not as simple as it might seem. |
T97 |
25012-25163 |
Epistemic_statement |
denotes |
For example, a series of databases, each dedicated to a family of viruses, might all need to support many different types of molecular biological data. |
T98 |
25164-25317 |
Epistemic_statement |
denotes |
Alternatively, databases managing a particular data type (eg, sequence or virion structure) would need to deal with many taxonomically different viruses. |
T99 |
25378-25581 |
Epistemic_statement |
denotes |
Although this article aims to provide an updated overview of the biological databases relevant to viruses, as a print publication it is important to note that this resource will become quickly out-dated. |
T100 |
25944-26097 |
Epistemic_statement |
denotes |
Alternatively, Google may be your friend, but it will help if the name of the database is not easily confused with a variety of other Internet resources. |
T101 |
26098-26280 |
Epistemic_statement |
denotes |
The Virus Pathogen Resource (ViPR, pronounced viper) is particularly tricky to find if you don't know it's a Bioinformatics Resource Center and therefore you should look for viprbrc. |
T102 |
26281-26429 |
Epistemic_statement |
denotes |
The search is further complicated by the existence of VIPERdb, a separate database of virus capsid structures and the VIPRE Antivirus software tool. |
T103 |
26430-26560 |
Epistemic_statement |
denotes |
Although this review focuses on virus databases, most of these are reliant on other generic databases as the source of their data. |
T104 |
27017-27141 |
Epistemic_statement |
denotes |
This provides up-to-date public access to nucleotide sequence data that can be accessed through any of the three interfaces. |
T105 |
27542-27751 |
Epistemic_statement |
denotes |
This information helps in the reproducibility of genome assemblies in cases requiring review, and allows the user to analyse whether unexpected results are characteristic of the virus or a systematic artifact. |
T106 |
27752-27957 |
Epistemic_statement |
denotes |
However, advances in sequencing technology raise the question of the value of saving raw sequencing data; re-sequencing of samples (given their availability) is fast, easy, cheap and increasingly accurate. |
T107 |
28056-28222 |
Epistemic_statement |
denotes |
Therefore, if sequence errors are detected or new genes discovered within a genome by other research groups, it might be impossible to update the original submission. |
T108 |
28223-28407 |
Epistemic_statement |
denotes |
To deal with this problem, NCBI has created the Reference Sequences resource (RefSeq), which offers up-to-date reference genomes for taxonomically diverse organisms, including viruses. |
T109 |
28692-28798 |
Epistemic_statement |
denotes |
RefSeq files are not limited to nucleotide sequences, but also offer transcript and protein sequence data. |
T110 |
29079-29731 |
Epistemic_statement |
denotes |
The database is divided into several branches: UniProtKB, the protein knowledgebase with two subsections, TrEMBL (translated EMBL Nucleotide sequence data library) that stores automatically annotated proteins prior to review and Swiss-Prot, containing proteins that have been manually annotated and reviewed, and often have associated literature; UniParc functions as an archive, sorting new, revised and obsolete sequences with a non-redundant numbering scheme allowing outdated UniProt references from past literature to be traceable; and UniRef100, 90 and 50 branches that cluster proteins into groups of 100%, 90% and 50% aa identity, respectively. |
T111 |
29732-29937 |
Epistemic_statement |
denotes |
To ensure up-to-date public access, new protein sequences must be submitted to UniProt prior to publication, the new protein sequences associated with a genome submitted to GenBank are automatically added. |
T112 |
29938-30129 |
Epistemic_statement |
denotes |
Knowledge of a protein's 3D structure can assist in the prediction of its functions and interactions, which are important aspects of understanding viral processes and drug and vaccine-design. |
T113 |
30130-30245 |
Epistemic_statement |
denotes |
As a result, 3D structures have been determined for many viral proteins, and in some cases, of the complete virion. |
T114 |
30246-30375 |
Epistemic_statement |
denotes |
The Protein Data Bank (PDB) collects all biochemical structures, but searches can be limited to the structures of viral proteins. |
T115 |
30749-31325 |
Epistemic_statement |
denotes |
These include: VIPERdb, which is maintained at the Scripps Research Institute and is a database for icosahedral virus capsid structures; The Big Picture Book of Viruses, a large catalog of virus pictures with associated information; Virus World, a summary of pictures and links available from PDB and VIPERdb organized by virus name; and Viral Protein Structure Resource (ViPs), a database that aims to provide a central source for all viral protein structures and provides a genome map feature allowing the user to determine which genes have structural information available. |
T116 |
31629-31836 |
Epistemic_statement |
denotes |
Several of these will be discussed below; however, as an aside, there are also aspects of the standard databases that provide useful compartmentalization of data to make working with specific viruses easier. |
T117 |
32503-32655 |
Epistemic_statement |
denotes |
Understanding these differences can help categorize the various virus-specific databases, and address which types of information can be drawn from each. |
T118 |
32736-32869 |
Epistemic_statement |
denotes |
At the most basic level, a website, can be used to present a collection of links to sequences in files or more traditional databases. |
T119 |
33005-33155 |
Epistemic_statement |
denotes |
Although these resources usually deal with small data sets, they can help offer a more visually appealing and manageable access point for researchers. |
T120 |
33156-33327 |
Epistemic_statement |
denotes |
Although the same files can be easily stored on a local desktop computer, there is an important benefit to accessing from a website -the data should be the same/unaltered. |
T121 |
33328-33506 |
Epistemic_statement |
denotes |
Since most researchers tend to collect dozens of versions of sequences, often with various edits, accessing data from a website ensures the sequence is what it is supposed to be. |
T122 |
34030-34165 |
Epistemic_statement |
denotes |
For example, when NCBI BLAST searches the sequence databases, it is possible to filter the results by keywords and taxonomy categories. |
T123 |
34783-34980 |
Epistemic_statement |
denotes |
Examples of this genome-related data could include: G + C% content of genes and genomes, codon composition of genes, pI of proteins, predicted MW of proteins and amino acid composition of proteins. |
T124 |
35326-35663 |
Epistemic_statement |
denotes |
The remainder of this section will compare and contrast three types of virus database resources: the Influenza Virus Resource at NCBI; the Virus Pathogen Resource Bioinformatics Resource Center (ViPRbrc) supported by the NIH, which supports a variety of viruses; and the author's Virology.ca resource, which supports large dsDNA viruses. |
T125 |
35879-36050 |
Epistemic_statement |
denotes |
The search interface supports the query of sequences (protein, CDS or nucleotide) by influenza type, genome segment, serotype, collection date, host and country of origin. |
T126 |
36051-36209 |
Epistemic_statement |
denotes |
Additional filters can remove sequences that are not full length (of segment), that are not part of a full genome set, or that are identical to another virus. |
T127 |
36210-36346 |
Epistemic_statement |
denotes |
In addition, pandemic H1N1 sequences can be included or excluded, as can sequences from lab strains, vaccines or certain other projects. |
T128 |
36347-36508 |
Epistemic_statement |
denotes |
Once selected, sequences can be exported to a local computer or aligned with subsequent generation of a phylogenetic tree; the interface is completely web-based. |
T129 |
36794-36952 |
Epistemic_statement |
denotes |
ViPRbrc is funded by NIH to support research on viruses on the NIAID Category A-C Priority Pathogen lists, and those causing (re)emerging infectious diseases. |
T130 |
37695-37814 |
Epistemic_statement |
denotes |
Although the database selection, retrieval and analysis tools are still web-based, they are comprehensive and advanced. |
T131 |
37815-37936 |
Epistemic_statement |
denotes |
Some of the analyses provide graphical output (eg, genome maps), but have limited comparative tools for further analysis. |
T132 |
38105-38287 |
Epistemic_statement |
denotes |
This allows the user to store a variety of retrieved sequence sets, eliminating the need to repeat the tedious selection process, which in turn likely increases accuracy of the work. |
T133 |
38515-38664 |
Epistemic_statement |
denotes |
Data can also be exported from the database in a variety of formats, including GFF3, which is a tab-delimited format for describing genomic features. |
T134 |
38942-39256 |
Epistemic_statement |
denotes |
To this aim, ViPRbrc tools allow BLAST searches of custom (virus-specific) databases and the creation and visualization of MSAs and phylogenetic trees, and one of the more valuable services provided is the tool for Analysis of Sequence Variation, that is SNPs (also amino acid variation) along sequences of an MSA. |
T135 |
39257-39457 |
Epistemic_statement |
denotes |
In contrast to ViPRbrc, Virology.ca is dedicated to the support of large dsDNA viruses and as noted previously, this involves the provision of different tools to perform comparative genomics analyses. |
T136 |
39458-39668 |
Epistemic_statement |
denotes |
Although the genomes are 20 -30 times the size of the RNA viruses, the current poxvirus database at Virology.ca contains less than 270 genomes, with the most studied virus species having 30 -50 representatives. |
T137 |
39819-40000 |
Epistemic_statement |
denotes |
However, only 40 are present in all poxvirus genomes, but this conserved set of core genes increases to about 80 if poxviruses that infect insects are excluded from the calculation. |
T138 |
40001-40096 |
Epistemic_statement |
denotes |
The remaining non-essential genes are often associated with host range and virulence functions. |
T139 |
40097-40325 |
Epistemic_statement |
denotes |
In the Orthopoxvirus genus, the biggest genomes tend to be found in viruses with the widest host range; it is thought that the restriction of viruses to a limited host range is associated with the loss of genes or gene function. |
T140 |
41763-41843 |
Epistemic_statement |
denotes |
The program can also be used to view RNA-Seq data and analyze for recombination. |
T141 |
41844-42184 |
Epistemic_statement |
denotes |
BBB offers its own format of data storage, an XML file (.bbb), that allows for additional features such as the storage of user comments, primer annotations, and genome MSAs with gene annotations JDotter A program for generating dotplots; suitable for whole genomes, sub-genomes or protein sequences Genome Annotation Transfer Utility (GATU) |
T142 |
42185-42261 |
Epistemic_statement |
denotes |
A tool used to annotate genomes based on a closely related reference genome. |
T143 |
42420-42544 |
Epistemic_statement |
denotes |
GATU also suggests novel genes to the human annotator, who has last word on the annotation process Sequence Searcher (SSeq): |
T144 |
42545-42645 |
Epistemic_statement |
denotes |
An easy-to-use Java tool for searching protein and DNA sequences for user-specified sequence motifs. |
T145 |
42714-43016 |
Epistemic_statement |
denotes |
These features are helpful when working with the large DNA viruses because of the uncertainty associated with the annotation of various genes -genes that are predicted to encode small proteins with a very high or very low pI and unusual amino acid composition are likely to represent annotation errors. |
T146 |
43017-43200 |
Epistemic_statement |
denotes |
Indeed, many of the features present in Base-By-Base and VGO, the tools that display the genome sequences and genome maps, respectively are devoted to helping solve annotation issues. |
T147 |
43201-43345 |
Epistemic_statement |
denotes |
For the large DNA viruses, accurate annotation is very important because a common investigation asks, Why is virus X more virulent than virus Y? |
T148 |
43550-43662 |
Epistemic_statement |
denotes |
So far, we have divided discussion issues by genome type; however, two additional databases should be mentioned. |
T149 |
45150-45279 |
Epistemic_statement |
denotes |
This information, defining relationships among viruses, aids in the classification and understanding of newly discovered viruses. |
T150 |
45885-46016 |
Epistemic_statement |
denotes |
However, there are 78 families of viruses that not assigned to an order, including one called Unassigned, which contains 14 genera. |
T151 |
46269-46477 |
Epistemic_statement |
denotes |
Sequences need to be organized to facilitate the process of similarity searching, for example with BLAST, for related sequences and also organized by functional or source (human, rodent, virus) relationships. |
T152 |
46478-46690 |
Epistemic_statement |
denotes |
However, it's neither feasible nor affordable to create independent database resources for every organism, therefore resources like ViPRbrc that can function with a variety of viruses may become more commonplace. |
T153 |
46691-46862 |
Epistemic_statement |
denotes |
With respect to genome sequencing, one of the greatest problems lies with the huge volume of raw data (sequencing reads) that is associated with any final genome sequence. |
T154 |
46863-47098 |
Epistemic_statement |
denotes |
The authors recently received 3 GB of compressed sequencing data, a mix of host and virus sequences, to assemble a 150 kb poxvirus genome; it is estimated that genomic data will soon become the world's largest consumer of disk storage. |
T155 |
47456-47646 |
Epistemic_statement |
denotes |
Perhaps, in the not too distant future, it will not be unusual to have our own genome sequenced, as well as our various organ microbiomes and the genomes of any pathogens we are infected by. |
T156 |
47647-48053 |
Epistemic_statement |
denotes |
To deal with the benefits of the new sequencing technology's ability to generate massive sequence coverage, which helps to reduce errors in the initial sequencing process and reveals the natural variation among the genomes of a virus population, we must also develop new annotation strategies such as how to annotate SNPs in virus genomes and gene fragments in the non-essential gene sets of large viruses. |
T157 |
48054-48164 |
Epistemic_statement |
denotes |
The goal of this type of annotation is to provide the virologist with more information in comparative studies. |
T158 |
48299-48443 |
Epistemic_statement |
denotes |
The questioner is really asking about the presence of functional genes, but would likely be interested in knowing the mechanism of the deletion. |
T159 |
48444-48592 |
Epistemic_statement |
denotes |
For example, whether the gene was inactivated by a series of deletions or a single nucleotide change that could be the result of a sequencing error. |
T160 |
48593-48910 |
Epistemic_statement |
denotes |
Not only is the number of genomes requiring annotation increasing exponentially, but the nature of the data is also changing with metagenomics' ability to determine all of the nucleotide sequences from a crude environmental sample without the need to culture microbes in the laboratory or probe for a particular gene. |
T161 |
48911-49126 |
Epistemic_statement |
denotes |
Since many organisms fail to grow in the laboratory, we cannot grow viruses that are unique to them; thus, metagenomics provides a valuable window into the previously unknown diversity of viruses in the environment. |
T162 |
49127-49264 |
Epistemic_statement |
denotes |
Even with large-scale environmental sequencing under way, most metagenomic sequences appear to be unrelated to currently known sequences. |
T163 |
49365-49452 |
Epistemic_statement |
denotes |
However, metagenomics also comes with its own list of special problems and limitations. |
T164 |
49549-49675 |
Epistemic_statement |
denotes |
Genomic assembly can be difficult because the metagenomic short reads often provide less coverage than traditional sequencing. |
T165 |
49881-50027 |
Epistemic_statement |
denotes |
Although assembly against a reference genome is helpful, this approach can't be used with environmental samples full of unknown microbial species. |
T166 |
50168-50312 |
Epistemic_statement |
denotes |
Furthermore, metadata is often left out of annotations, yet it is required for the data to be put into a useful, comparable, biological context. |
T167 |
50483-50606 |
Epistemic_statement |
denotes |
Thus, it is crucial that annotation of metadata be included as a mandatory and standardized component of genome submission. |
T168 |
50607-50913 |
Epistemic_statement |
denotes |
With the accumulation of more and more diverse sequences from metagenomic sequencing of environmental samples, the organization of gene/protein sequences into Clusters of Orthologous groups (COGs; http://www.ncbi.nlm.nih.gov/COG/) and Viral COGs (VOGs) becomes especially useful for defining relationships. |
T169 |
50914-51260 |
Epistemic_statement |
denotes |
Even when matches between orthologs may be at the limit of detection, the greater the number of sequences in a cluster, the greater the chance of making new matches by virtue of transitive sequence comparisons, that is SeqA matches SeqB, but not SeqC; if SeqB also matches SeqC then a transitive relationship between SeqA and SeqC is established. |
T170 |
51678-52274 |
Epistemic_statement |
denotes |
Several features of this updated version may help increase the knowledge and use of ortholog groups by the virology community: the new web interface is simplified for less experienced users and offers new browsing and visualization capabilities; the new version has improved consistency and impact through incorporation and linking of Gene Ontology terms, KEGG pathways and UniProt and SMART/Pfam domains to ortholog group assignments; and the algorithms and pipelines for assignment are made freely available on the website, encouraging individuals to apply the techniques to their own datasets. |
T171 |
52275-52582 |
Epistemic_statement |
denotes |
In recent years, the growth in both the volume and types of data that can be considered bioinformatic in nature has forced the scientific community to consider how much of a role that the manual forms of genome annotation, curation and maintenance can continue to play in assigning knowledge to new genomes. |
T172 |
52583-52771 |
Epistemic_statement |
denotes |
A compromise may be found in the maintenance of manually curated reference genomes, and development of programs to aid in increasing the accuracy of automated annotations of new sequences. |
T173 |
52772-52997 |
Epistemic_statement |
denotes |
Finally, the question remains whether funding will be made available for tailoring databases to the needs of virology researchers working on the wide array of bioinformatics challenges that these new data sets are generating. |