CORD-19:d0620683a194e739baca5c13917899ad7e5b2337 JSONTXT 8 Projects

Annnotations TAB TSV DIC JSON TextAE Lectin_function

Id Subject Object Predicate Lexical cue
TextSentencer_T1 0-87 Sentence denotes VIP: an integrated pipeline for metagenomics of virus identification and discovery OPEN
TextSentencer_T1 0-87 Sentence denotes VIP: an integrated pipeline for metagenomics of virus identification and discovery OPEN
TextSentencer_T2 89-97 Sentence denotes Abstract
TextSentencer_T2 89-97 Sentence denotes Abstract
TextSentencer_T3 98-315 Sentence denotes Identification and discovery of viruses using next-generation sequencing technology is a fastdeveloping area with potential wide application in clinical diagnostics, public health monitoring and novel virus discovery.
TextSentencer_T3 98-315 Sentence denotes Identification and discovery of viruses using next-generation sequencing technology is a fastdeveloping area with potential wide application in clinical diagnostics, public health monitoring and novel virus discovery.
TextSentencer_T4 316-450 Sentence denotes However, tremendous sequence data from NGS study has posed great challenge both in accuracy and velocity for application of NGS study.
TextSentencer_T4 316-450 Sentence denotes However, tremendous sequence data from NGS study has posed great challenge both in accuracy and velocity for application of NGS study.
TextSentencer_T5 451-607 Sentence denotes Here we describe VIP ("Virus Identification Pipeline"), a one-touch computational pipeline for virus identification and discovery from metagenomic NGS data.
TextSentencer_T5 451-607 Sentence denotes Here we describe VIP ("Virus Identification Pipeline"), a one-touch computational pipeline for virus identification and discovery from metagenomic NGS data.
TextSentencer_T6 608-911 Sentence denotes VIP performs the following steps to achieve its goal: (i) map and filter out background-related reads, (ii) extensive classification of reads on the basis of nucleotide and remote amino acid homology, (iii) multiple k-mer based de novo assembly and phylogenetic analysis to provide evolutionary insight.
TextSentencer_T6 608-911 Sentence denotes VIP performs the following steps to achieve its goal: (i) map and filter out background-related reads, (ii) extensive classification of reads on the basis of nucleotide and remote amino acid homology, (iii) multiple k-mer based de novo assembly and phylogenetic analysis to provide evolutionary insight.
TextSentencer_T7 912-1052 Sentence denotes We validated the feasibility and veracity of this pipeline with sequencing results of various types of clinical samples and public datasets.
TextSentencer_T7 912-1052 Sentence denotes We validated the feasibility and veracity of this pipeline with sequencing results of various types of clinical samples and public datasets.
TextSentencer_T8 1053-1262 Sentence denotes VIP has also contributed to timely virus diagnosis (~10 min) in acutely ill patients, demonstrating its potential in the performance of unbiased NGS-based clinical studies with demand of short turnaround time.
TextSentencer_T8 1053-1262 Sentence denotes VIP has also contributed to timely virus diagnosis (~10 min) in acutely ill patients, demonstrating its potential in the performance of unbiased NGS-based clinical studies with demand of short turnaround time.
TextSentencer_T9 1263-1364 Sentence denotes VIP is released under GPLv3 and is available for free download at: https://github.com/keylabivdc/VIP.
TextSentencer_T9 1263-1364 Sentence denotes VIP is released under GPLv3 and is available for free download at: https://github.com/keylabivdc/VIP.
TextSentencer_T10 1365-1426 Sentence denotes The world contains a high diversity of human viral pathogens.
TextSentencer_T10 1365-1426 Sentence denotes The world contains a high diversity of human viral pathogens.
TextSentencer_T11 1427-1539 Sentence denotes There are approximately 300 recognized viral pathogen species, and additional species continue to be discovered.
TextSentencer_T11 1427-1539 Sentence denotes There are approximately 300 recognized viral pathogen species, and additional species continue to be discovered.
TextSentencer_T12 1540-1653 Sentence denotes The identification of viral pathogens has a tremendous impact on infectious diseases, virology and public health.
TextSentencer_T12 1540-1653 Sentence denotes The identification of viral pathogens has a tremendous impact on infectious diseases, virology and public health.
TextSentencer_T13 1654-1944 Sentence denotes Nearly all of the outbreaks of public health issues over the last decade have been caused by viruses, including Severe Acute Respiratory Syndrome (SARS) coronavirus 1 , 2009 pandemic influenza H1N1 2 , H7N9 avian influenza viruses 3 and the recently described Ebola virus in West Africa 4 .
TextSentencer_T13 1654-1944 Sentence denotes Nearly all of the outbreaks of public health issues over the last decade have been caused by viruses, including Severe Acute Respiratory Syndrome (SARS) coronavirus 1 , 2009 pandemic influenza H1N1 2 , H7N9 avian influenza viruses 3 and the recently described Ebola virus in West Africa 4 .
TextSentencer_T14 1945-2147 Sentence denotes Traditional diagnostic methods for viruses, such as cell culture, serodiagnosis, or nucleic acid-based testing are narrow in scope and require a priori knowledge of the potential infectious agents 5,6 .
TextSentencer_T14 1945-2147 Sentence denotes Traditional diagnostic methods for viruses, such as cell culture, serodiagnosis, or nucleic acid-based testing are narrow in scope and require a priori knowledge of the potential infectious agents 5,6 .
TextSentencer_T15 2148-2304 Sentence denotes Accurate diagnosis and timely treatment for the infection dramatically reduced the risk of continued transmission and mortality in hospitalized patients 7 .
TextSentencer_T15 2148-2304 Sentence denotes Accurate diagnosis and timely treatment for the infection dramatically reduced the risk of continued transmission and mortality in hospitalized patients 7 .
TextSentencer_T16 2305-2479 Sentence denotes Wild interest in comprehensive detection of these newly emerging and re-emerging viruses from clinical samples highlight the need for rapid, broad-spectrum diagnostic assays.
TextSentencer_T16 2305-2479 Sentence denotes Wild interest in comprehensive detection of these newly emerging and re-emerging viruses from clinical samples highlight the need for rapid, broad-spectrum diagnostic assays.
TextSentencer_T17 2480-2607 Sentence denotes Shotgun metagenomic sequencing of clinical samples for viral pathogen identification provides a promising alternative solution.
TextSentencer_T17 2480-2607 Sentence denotes Shotgun metagenomic sequencing of clinical samples for viral pathogen identification provides a promising alternative solution.
TextSentencer_T18 2608-2956 Sentence denotes Although metagenomics is typically applied to understanding genomic diversity from environment samples, this methodology has also revolutionized virology with comprehensive applications, including viral pathogen identification of infectious disease in clinical laboratories 8 and virus discovery in acute and chronic illnesses of unknown origin 9 .
TextSentencer_T18 2608-2956 Sentence denotes Although metagenomics is typically applied to understanding genomic diversity from environment samples, this methodology has also revolutionized virology with comprehensive applications, including viral pathogen identification of infectious disease in clinical laboratories 8 and virus discovery in acute and chronic illnesses of unknown origin 9 .
TextSentencer_T19 2957-3179 Sentence denotes Many novel viruses have been discovered using popular next generation sequencing (NGS) platforms such as pyrosequencing (454 Roche), semiconductor sequencing (Life Technology) and illumina dye sequencing (Illumina) 10-12 .
TextSentencer_T19 2957-3179 Sentence denotes Many novel viruses have been discovered using popular next generation sequencing (NGS) platforms such as pyrosequencing (454 Roche), semiconductor sequencing (Life Technology) and illumina dye sequencing (Illumina) 10-12 .
TextSentencer_T20 3180-3503 Sentence denotes Achievements obtained by viral metagenomics show significant advantages over traditional methods of identifying a viral pathogen, including no need of sequence information for that pathogen, identifying multiple pathogens in a single assay and eliminating the need for time-consuming culturing or antibody laboratory tests.
TextSentencer_T20 3180-3503 Sentence denotes Achievements obtained by viral metagenomics show significant advantages over traditional methods of identifying a viral pathogen, including no need of sequence information for that pathogen, identifying multiple pathogens in a single assay and eliminating the need for time-consuming culturing or antibody laboratory tests.
TextSentencer_T21 3504-3557 Sentence denotes A key feature of latest NGS platforms is their speed.
TextSentencer_T21 3504-3557 Sentence denotes A key feature of latest NGS platforms is their speed.
TextSentencer_T22 3558-3625 Sentence denotes It takes minimum turnaround times about 8 hours for sequencing 13 .
TextSentencer_T22 3558-3625 Sentence denotes It takes minimum turnaround times about 8 hours for sequencing 13 .
TextSentencer_T23 3626-3717 Sentence denotes Thus, it is critical that subsequent computational handling of the large amount of sequence
TextSentencer_T23 3626-3717 Sentence denotes Thus, it is critical that subsequent computational handling of the large amount of sequence
TextSentencer_T24 3719-3836 Sentence denotes data generated in viral metagenome sequencing must be performed within a timeframe suitable for actionable responses.
TextSentencer_T24 3719-3836 Sentence denotes data generated in viral metagenome sequencing must be performed within a timeframe suitable for actionable responses.
TextSentencer_T25 3837-4062 Sentence denotes Most commercial NGS services, however, offer basic bioinformatics support such as de novo sequence assembly or mapping to reference genomes, but will not process further to the specifics of virus identification and discovery.
TextSentencer_T25 3837-4062 Sentence denotes Most commercial NGS services, however, offer basic bioinformatics support such as de novo sequence assembly or mapping to reference genomes, but will not process further to the specifics of virus identification and discovery.
TextSentencer_T26 4063-4156 Sentence denotes There are many bioinformatics tools developed specifically for virus detection from NGS data.
TextSentencer_T26 4063-4156 Sentence denotes There are many bioinformatics tools developed specifically for virus detection from NGS data.
TextSentencer_T27 4157-4255 Sentence denotes In general, the strategies in these pipelines are computational subtraction to pathogen detection.
TextSentencer_T27 4157-4255 Sentence denotes In general, the strategies in these pipelines are computational subtraction to pathogen detection.
TextSentencer_T28 4256-4445 Sentence denotes Reads corresponding to the host (e.g., human) are first removed, followed by alignment to reference databases (DB) that contain sequences from candidate pathogens [14] [15] [16] [17] [18] .
TextSentencer_T28 4256-4445 Sentence denotes Reads corresponding to the host (e.g., human) are first removed, followed by alignment to reference databases (DB) that contain sequences from candidate pathogens [14] [15] [16] [17] [18] .
TextSentencer_T29 4446-4592 Sentence denotes The most common for virus identifying are local alignments with reference DB, such as the Basic Local Alignment Search Tool (BLAST) algorithm 19 .
TextSentencer_T29 4446-4592 Sentence denotes The most common for virus identifying are local alignments with reference DB, such as the Basic Local Alignment Search Tool (BLAST) algorithm 19 .
TextSentencer_T30 4593-4811 Sentence denotes Analysis pipelines that use faster algorithms (e.g., Bowtie or Bowtie2) for host computational subtraction, such as VirFinder 15 and VirusFinder 16 rely on traditional BLAST approaches for final pathogen determination.
TextSentencer_T30 4593-4811 Sentence denotes Analysis pipelines that use faster algorithms (e.g., Bowtie or Bowtie2) for host computational subtraction, such as VirFinder 15 and VirusFinder 16 rely on traditional BLAST approaches for final pathogen determination.
TextSentencer_T31 4812-5074 Sentence denotes BLAST is generally used in these tools for classification of viral reads at the nucleotide level (BLASTn), followed by less stringency protein alignments using a translated amino acid alignment (BLASTx) for identification of novel viruses with divergent genomes.
TextSentencer_T31 4812-5074 Sentence denotes BLAST is generally used in these tools for classification of viral reads at the nucleotide level (BLASTn), followed by less stringency protein alignments using a translated amino acid alignment (BLASTx) for identification of novel viruses with divergent genomes.
TextSentencer_T32 5075-5128 Sentence denotes However, BLAST is too slow for massive data from NGS.
TextSentencer_T32 5075-5128 Sentence denotes However, BLAST is too slow for massive data from NGS.
TextSentencer_T33 5129-5247 Sentence denotes For example, end-to-end processing times, even on multicore computational servers, can take several days to weeks 14 .
TextSentencer_T33 5129-5247 Sentence denotes For example, end-to-end processing times, even on multicore computational servers, can take several days to weeks 14 .
TextSentencer_T34 5248-5293 Sentence denotes Another issue is related to de novo assembly.
TextSentencer_T34 5248-5293 Sentence denotes Another issue is related to de novo assembly.
TextSentencer_T35 5294-5564 Sentence denotes The majority of pipelines utilized to assemble metagenomics data were originally developed to assemble single genomes, However, single-genome assemblers were not aimed to assemble multiple genomes from metagenomics data which were with nonuniform sequence coverages 20 .
TextSentencer_T35 5294-5564 Sentence denotes The majority of pipelines utilized to assemble metagenomics data were originally developed to assemble single genomes, However, single-genome assemblers were not aimed to assemble multiple genomes from metagenomics data which were with nonuniform sequence coverages 20 .
TextSentencer_T36 5565-5709 Sentence denotes Problems in the assembly results may include chimeric contigs (reads artificially combined during assembly) which are not easy to be recognized.
TextSentencer_T36 5565-5709 Sentence denotes Problems in the assembly results may include chimeric contigs (reads artificially combined during assembly) which are not easy to be recognized.
TextSentencer_T37 5710-6022 Sentence denotes Additional limitations of these available bioinformatics software for viral pathogen identification include high hardware requirements (multicore servers), the need for bioinformatics expertise, and the lack of validated and clear results to permit confident identification of viruses from metagenomics NGS data.
TextSentencer_T37 5710-6022 Sentence denotes Additional limitations of these available bioinformatics software for viral pathogen identification include high hardware requirements (multicore servers), the need for bioinformatics expertise, and the lack of validated and clear results to permit confident identification of viruses from metagenomics NGS data.
TextSentencer_T38 6023-6145 Sentence denotes Biologists often have to rely on professional bioinformaticians to process NGS data, posing a bottleneck in data analysis.
TextSentencer_T38 6023-6145 Sentence denotes Biologists often have to rely on professional bioinformaticians to process NGS data, posing a bottleneck in data analysis.
TextSentencer_T39 6146-6332 Sentence denotes Here we present VIP (Virus Identification Pipeline), a one-touch bioinformatic pipeline for virus identification with fairly self-explanatory results in Hypertext Markup Language (HTML).
TextSentencer_T39 6146-6332 Sentence denotes Here we present VIP (Virus Identification Pipeline), a one-touch bioinformatic pipeline for virus identification with fairly self-explanatory results in Hypertext Markup Language (HTML).
TextSentencer_T40 6333-6645 Sentence denotes VIP performs extensive classification of reads against DB collected by Virus Pathogen Resource (ViPR) 21 and Influenza Research Database (IRD) 22 nucleotide DB in fast mode and against the virus sequences with NCBI Refseq (http://www.ncbi.nlm.nih. gov/refseq/) and their neighbor genomes in sense mode (Fig. 1) .
TextSentencer_T40 6333-6645 Sentence denotes VIP performs extensive classification of reads against DB collected by Virus Pathogen Resource (ViPR) 21 and Influenza Research Database (IRD) 22 nucleotide DB in fast mode and against the virus sequences with NCBI Refseq (http://www.ncbi.nlm.nih. gov/refseq/) and their neighbor genomes in sense mode (Fig. 1) .
TextSentencer_T41 6646-6801 Sentence denotes Novel viruses are firstly identified in sense mode by amino acid alignment to NCBI virus Refseq amio acid DB and followed by phylogenetic characterization.
TextSentencer_T41 6646-6801 Sentence denotes Novel viruses are firstly identified in sense mode by amino acid alignment to NCBI virus Refseq amio acid DB and followed by phylogenetic characterization.
TextSentencer_T42 6802-6979 Sentence denotes Raw NGS reads are first preprocessed by removal of adapter, low-quality, and low-complexity sequences, followed by computational subtraction of host-related reads using Bowtie2.
TextSentencer_T42 6802-6979 Sentence denotes Raw NGS reads are first preprocessed by removal of adapter, low-quality, and low-complexity sequences, followed by computational subtraction of host-related reads using Bowtie2.
TextSentencer_T43 6980-7064 Sentence denotes In fast mode, viruses are identified by Bowtie2 alignment to ViPR/IRD nucleotide DB.
TextSentencer_T43 6980-7064 Sentence denotes In fast mode, viruses are identified by Bowtie2 alignment to ViPR/IRD nucleotide DB.
TextSentencer_T44 7065-7194 Sentence denotes In sense mode, bacteria and related rRNA (ribosomal RNA) reads are removed and the remaining reads are aligned to virus database.
TextSentencer_T44 7065-7194 Sentence denotes In sense mode, bacteria and related rRNA (ribosomal RNA) reads are removed and the remaining reads are aligned to virus database.
TextSentencer_T45 7195-7292 Sentence denotes Unmatched reads are then aligned to a viral protein database from NCBI Refseq DB using RAPSearch.
TextSentencer_T45 7195-7292 Sentence denotes Unmatched reads are then aligned to a viral protein database from NCBI Refseq DB using RAPSearch.
TextSentencer_T46 7293-7387 Sentence denotes All matched reads are classified under a genus for de novo assembly and phylogenetic analysis.
TextSentencer_T46 7293-7387 Sentence denotes All matched reads are classified under a genus for de novo assembly and phylogenetic analysis.
TextSentencer_T47 7388-7491 Sentence denotes VIP reports include reads distribution, a summary table of classified reads with taxonomic assignments.
TextSentencer_T47 7388-7491 Sentence denotes VIP reports include reads distribution, a summary table of classified reads with taxonomic assignments.
TextSentencer_T48 7492-7576 Sentence denotes In addition, results of phylogenetic analysis and genomic coverage map are attached.
TextSentencer_T48 7492-7576 Sentence denotes In addition, results of phylogenetic analysis and genomic coverage map are attached.
TextSentencer_T49 7577-7850 Sentence denotes Notably, VIP generates results in a clinically actionable timeframe of minutes to hours with lower hardware requirements by leveraging two alignment tools, Bowtie2 23 and RAPSearch 24 , which have computational times that are orders of magnitude faster than BLAST packages.
TextSentencer_T49 7577-7850 Sentence denotes Notably, VIP generates results in a clinically actionable timeframe of minutes to hours with lower hardware requirements by leveraging two alignment tools, Bowtie2 23 and RAPSearch 24 , which have computational times that are orders of magnitude faster than BLAST packages.
TextSentencer_T50 7851-8025 Sentence denotes Here we evaluate the performance of VIP for viruses characterization using various NGS datasets public available generated by cross-platforms (454, Ion Torrent and Illumina).
TextSentencer_T50 7851-8025 Sentence denotes Here we evaluate the performance of VIP for viruses characterization using various NGS datasets public available generated by cross-platforms (454, Ion Torrent and Illumina).
TextSentencer_T51 8026-8143 Sentence denotes These data sets encompass a variety of clinical infections, detected pathogens, sample types, and depths of coverage.
TextSentencer_T51 8026-8143 Sentence denotes These data sets encompass a variety of clinical infections, detected pathogens, sample types, and depths of coverage.
TextSentencer_T52 8144-8288 Sentence denotes We also demonstrate the use of the pipeline for detection of re-emerging dengue virus 1 (DENV-1) from a case with a febrile status in Guangzhou.
TextSentencer_T52 8144-8288 Sentence denotes We also demonstrate the use of the pipeline for detection of re-emerging dengue virus 1 (DENV-1) from a case with a febrile status in Guangzhou.
TextSentencer_T53 8289-8351 Sentence denotes Accuracy of the classification strategy using in-silicon data.
TextSentencer_T53 8289-8351 Sentence denotes Accuracy of the classification strategy using in-silicon data.
TextSentencer_T54 8352-8511 Sentence denotes In order to evaluate the accuracy of the two-steps classification strategy (sense mode), a query data set of 125 base pair (bp) reads was generated in-silicon.
TextSentencer_T54 8352-8511 Sentence denotes In order to evaluate the accuracy of the two-steps classification strategy (sense mode), a query data set of 125 base pair (bp) reads was generated in-silicon.
TextSentencer_T55 8512-8897 Sentence denotes The viral reads were generated at different mutation rate ranging from 1% up to 40% from four known representative viruses which were Dengue Virus 2 (DENV-2, a single positive-standed RNA), Epstein-Barr virus (EBV, a double stranded DNA), Norovirus (a genetically diverse group of single-stranded RNA) and H7N9 (segmented negative-sense RNA) using wgsim (https://github.com/lh3/wgsim).
TextSentencer_T55 8512-8897 Sentence denotes The viral reads were generated at different mutation rate ranging from 1% up to 40% from four known representative viruses which were Dengue Virus 2 (DENV-2, a single positive-standed RNA), Epstein-Barr virus (EBV, a double stranded DNA), Norovirus (a genetically diverse group of single-stranded RNA) and H7N9 (segmented negative-sense RNA) using wgsim (https://github.com/lh3/wgsim).
TextSentencer_T56 8898-9195 Sentence denotes Receiver operating characteristic (ROC) curves 25 were generated to assess the sensitivity and specificity of the classification methods at different mutation rate in classifying viral reads. (Figure 2 and see Supplementary Information) Nearly all the results shared the superior 100% specificity.
TextSentencer_T56 8898-9195 Sentence denotes Receiver operating characteristic (ROC) curves 25 were generated to assess the sensitivity and specificity of the classification methods at different mutation rate in classifying viral reads. (Figure 2 and see Supplementary Information) Nearly all the results shared the superior 100% specificity.
TextSentencer_T57 9196-9303 Sentence denotes The specificity was slightly reduced (99.83%) when classifying the DENV-2 viral reads at 40% mutation rate.
TextSentencer_T57 9196-9303 Sentence denotes The specificity was slightly reduced (99.83%) when classifying the DENV-2 viral reads at 40% mutation rate.
TextSentencer_T58 9304-9370 Sentence denotes The sensitivity was reduced while the mutation rate was increased.
TextSentencer_T58 9304-9370 Sentence denotes The sensitivity was reduced while the mutation rate was increased.
TextSentencer_T59 9371-9501 Sentence denotes Accurate virus identification, where sensitivity > 90%, was still achieved for with mutation rate at 20% using optimal parameters.
TextSentencer_T59 9371-9501 Sentence denotes Accurate virus identification, where sensitivity > 90%, was still achieved for with mutation rate at 20% using optimal parameters.
TextSentencer_T60 9502-9638 Sentence denotes Nevertheless, the overall poor performance in classifying the viral reads at high mutation rate (> 20%) required further investigations.
TextSentencer_T60 9502-9638 Sentence denotes Nevertheless, the overall poor performance in classifying the viral reads at high mutation rate (> 20%) required further investigations.
TextSentencer_T61 9639-9726 Sentence denotes Performance comparison between assembly software packages and assembly strategy in VIP.
TextSentencer_T61 9639-9726 Sentence denotes Performance comparison between assembly software packages and assembly strategy in VIP.
TextSentencer_T62 9727-10001 Sentence denotes In this study, we also compared the de novo assembly performance for VIP, IDBA_UD 26 and recently described Ensemble Assembler 27 via using public datasets, in-house datasets and in-silicon dataset at mutation rate 3% with viruses which were DENV-2, EBV, Norovirus and H7N9.
TextSentencer_T62 9727-10001 Sentence denotes In this study, we also compared the de novo assembly performance for VIP, IDBA_UD 26 and recently described Ensemble Assembler 27 via using public datasets, in-house datasets and in-silicon dataset at mutation rate 3% with viruses which were DENV-2, EBV, Norovirus and H7N9.
TextSentencer_T63 10002-10069 Sentence denotes In-house Perl scripts were developed to calculate assembly metrics.
TextSentencer_T63 10002-10069 Sentence denotes In-house Perl scripts were developed to calculate assembly metrics.
TextSentencer_T64 10070-10215 Sentence denotes The different assembly statistics calculated for each assembly results in order to assess the performance of each assembly are shown in Table 1 .
TextSentencer_T64 10070-10215 Sentence denotes The different assembly statistics calculated for each assembly results in order to assess the performance of each assembly are shown in Table 1 .
TextSentencer_T65 10216-10438 Sentence denotes This table describes the overall length and genome coverage estimators per assembly including the N50, percent of genome coverage of largest contig (%largest_con-tig_coverage) and percent of all contigs (%contig_coverage).
TextSentencer_T65 10216-10438 Sentence denotes This table describes the overall length and genome coverage estimators per assembly including the N50, percent of genome coverage of largest contig (%largest_con-tig_coverage) and percent of all contigs (%contig_coverage).
TextSentencer_T66 10439-10532 Sentence denotes In general, VIP prominently improved the N50 and obtained overall better outcomes (Table 1) .
TextSentencer_T66 10439-10532 Sentence denotes In general, VIP prominently improved the N50 and obtained overall better outcomes (Table 1) .
TextSentencer_T67 10533-10665 Sentence denotes For example, in in-house dataset 1, N50 acquired by VIP is 6058, while Ensemble Assembler and IDBA_UD were 207 and 358 respectively.
TextSentencer_T67 10533-10665 Sentence denotes For example, in in-house dataset 1, N50 acquired by VIP is 6058, while Ensemble Assembler and IDBA_UD were 207 and 358 respectively.
TextSentencer_T68 10666-10783 Sentence denotes The same trends were also found for percent of genome coverage of largest contig and percent of all contigs coverage.
TextSentencer_T68 10666-10783 Sentence denotes The same trends were also found for percent of genome coverage of largest contig and percent of all contigs coverage.
TextSentencer_T69 10784-10880 Sentence denotes Two datasets SRR453448 and SRR1106553 with IDBA_UD carried out slightly better results than VIP.
TextSentencer_T69 10784-10880 Sentence denotes Two datasets SRR453448 and SRR1106553 with IDBA_UD carried out slightly better results than VIP.
TextSentencer_T70 10881-10979 Sentence denotes The simulation data set was to evaluate the assembly performance and VIP demonstrated to build the
TextSentencer_T70 10881-10979 Sentence denotes The simulation data set was to evaluate the assembly performance and VIP demonstrated to build the
TextSentencer_T71 10981-11132 Sentence denotes To evaluate VIP, we first tested the ability of VIP to detect the presence of viruses in various samples generated from different sequencing platforms.
TextSentencer_T71 10981-11132 Sentence denotes To evaluate VIP, we first tested the ability of VIP to detect the presence of viruses in various samples generated from different sequencing platforms.
TextSentencer_T72 11133-11268 Sentence denotes Table 2 lists the NGS datasets available at Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) for our benchmark experiment.
TextSentencer_T72 11133-11268 Sentence denotes Table 2 lists the NGS datasets available at Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) for our benchmark experiment.
TextSentencer_T73 11269-11388 Sentence denotes These samples were infected with viruses of diverse types and all of results were confirmed by independent experiments.
TextSentencer_T73 11269-11388 Sentence denotes These samples were infected with viruses of diverse types and all of results were confirmed by independent experiments.
TextSentencer_T74 11389-11485 Sentence denotes As indicated in Table 1 , VIP identified the same virus type or subtype in all the test samples.
TextSentencer_T74 11389-11485 Sentence denotes As indicated in Table 1 , VIP identified the same virus type or subtype in all the test samples.
TextSentencer_T75 11486-11752 Sentence denotes For example, the viruses from data source SRR1106123 was isolated from serum, Hepatitis C virus and Hepatitis G virus were found with VIP pipeline, and the genome coverage were both 100% for the two identified viruses, which was confirmed by independent experiments.
TextSentencer_T75 11486-11752 Sentence denotes For example, the viruses from data source SRR1106123 was isolated from serum, Hepatitis C virus and Hepatitis G virus were found with VIP pipeline, and the genome coverage were both 100% for the two identified viruses, which was confirmed by independent experiments.
TextSentencer_T76 11753-11841 Sentence denotes Furthermore, the VIP has the power to discover more undetected viruses from the dataset.
TextSentencer_T76 11753-11841 Sentence denotes Furthermore, the VIP has the power to discover more undetected viruses from the dataset.
TextSentencer_T77 11842-12129 Sentence denotes In data source SRR1170820 from Ion Torrent PGM platform, Bovinecircovirus (BoCV) was reported by traditional test, while with VIP in addition to the Bovinecircovirus, the Bovine Viral Diarrhea Virus (BVDV) was also identified, and the genome coverage were 99.90% and 86.13% respectively.
TextSentencer_T77 11842-12129 Sentence denotes In data source SRR1170820 from Ion Torrent PGM platform, Bovinecircovirus (BoCV) was reported by traditional test, while with VIP in addition to the Bovinecircovirus, the Bovine Viral Diarrhea Virus (BVDV) was also identified, and the genome coverage were 99.90% and 86.13% respectively.
TextSentencer_T78 12130-12299 Sentence denotes We also evaluated VIP using additional 3 in-house samples from different sequencing platforms, including 2 mixed samples (sequenced by Hiseq 2000 and 454 respectively) .
TextSentencer_T78 12130-12299 Sentence denotes We also evaluated VIP using additional 3 in-house samples from different sequencing platforms, including 2 mixed samples (sequenced by Hiseq 2000 and 454 respectively) .
TextSentencer_T79 12300-12379 Sentence denotes These 3 in-house samples were isolated from stool, serum and swab respectively.
TextSentencer_T79 12300-12379 Sentence denotes These 3 in-house samples were isolated from stool, serum and swab respectively.
TextSentencer_T80 12380-12792 Sentence denotes For the mixed swab samples, we found Respiratory Syncytial Virus (RSV) and Human coronavirus 229E using our method, and the genome coverage were 98.41% and 17.00% respectively, it is worth noting that the number of reads of Human coronavirus 229E was only 262, which was merely 0.01 percent of its total reads (5,147,814 reads), the presence of Human coronavirus 229E was then confirmed by nested PCR experiment.
TextSentencer_T80 12380-12792 Sentence denotes For the mixed swab samples, we found Respiratory Syncytial Virus (RSV) and Human coronavirus 229E using our method, and the genome coverage were 98.41% and 17.00% respectively, it is worth noting that the number of reads of Human coronavirus 229E was only 262, which was merely 0.01 percent of its total reads (5,147,814 reads), the presence of Human coronavirus 229E was then confirmed by nested PCR experiment.
TextSentencer_T81 12793-12926 Sentence denotes VIP was able to correctly identify the virus types for all of these samples and hence demonstrated its robustness in virus detection.
TextSentencer_T81 12793-12926 Sentence denotes VIP was able to correctly identify the virus types for all of these samples and hence demonstrated its robustness in virus detection.
TextSentencer_T82 12927-13076 Sentence denotes We selected 3 sets of data from SRA to evaluate the sensitivity and specificity of VIP to identify candidate virus in clinical metagenomics datasets.
TextSentencer_T82 12927-13076 Sentence denotes We selected 3 sets of data from SRA to evaluate the sensitivity and specificity of VIP to identify candidate virus in clinical metagenomics datasets.
TextSentencer_T83 13077-13228 Sentence denotes Note that the 3 datasets results have been confirmed by related experiments and the genbank accession number of each reference sequence was introduced.
TextSentencer_T83 13077-13228 Sentence denotes Note that the 3 datasets results have been confirmed by related experiments and the genbank accession number of each reference sequence was introduced.
TextSentencer_T84 13229-13319 Sentence denotes Sensitivity and specificity were used to evaluate the ability of classification algorithm.
TextSentencer_T84 13229-13319 Sentence denotes Sensitivity and specificity were used to evaluate the ability of classification algorithm.
TextSentencer_T85 13320-13399 Sentence denotes Sensitivity, or true positive rate (TPR), was calculated as TPR = TP/(TP + FN).
TextSentencer_T85 13320-13399 Sentence denotes Sensitivity, or true positive rate (TPR), was calculated as TPR = TP/(TP + FN).
TextSentencer_T86 13400-13604 Sentence denotes Specificity, or true negative rate (TNR), was calculated as SPC = TN/(TN + FP), in which TP and TN represent true positives and true negatives, and FN and FP represent false negatives and false positives.
TextSentencer_T86 13400-13604 Sentence denotes Specificity, or true negative rate (TNR), was calculated as SPC = TN/(TN + FP), in which TP and TN represent true positives and true negatives, and FN and FP represent false negatives and false positives.
TextSentencer_T87 13605-13755 Sentence denotes The gold standard criterion for a correct viral classification was BLASTn alignment against the target viral genome at an E-value cutoff of 10 −8 14 .
TextSentencer_T87 13605-13755 Sentence denotes The gold standard criterion for a correct viral classification was BLASTn alignment against the target viral genome at an E-value cutoff of 10 −8 14 .
TextSentencer_T88 13756-13910 Sentence denotes According to the results, the sensitivity were ranged from 97.03% to 99.83%, and the specificity were all above the 99.89%. (see Supplementary Table S1 ).
TextSentencer_T88 13756-13910 Sentence denotes According to the results, the sensitivity were ranged from 97.03% to 99.83%, and the specificity were all above the 99.89%. (see Supplementary Table S1 ).
TextSentencer_T89 13911-14014 Sentence denotes analyze a clinical sample from 2014 Guangzhou Dengue outbreak to test the speed and feasibility of VIP.
TextSentencer_T89 13911-14014 Sentence denotes analyze a clinical sample from 2014 Guangzhou Dengue outbreak to test the speed and feasibility of VIP.
TextSentencer_T90 14015-14289 Sentence denotes As is shown in Fig. 3 , within a 60-hr sample-to-answer turnaround time and ~10 min VIP analysis time (fast model), nearly complete genome (96.32%) of dengue virus 1 (DENV-1) was obtained, and the fold coverage for coding regions was better than that for non-coding regions.
TextSentencer_T90 14015-14289 Sentence denotes As is shown in Fig. 3 , within a 60-hr sample-to-answer turnaround time and ~10 min VIP analysis time (fast model), nearly complete genome (96.32%) of dengue virus 1 (DENV-1) was obtained, and the fold coverage for coding regions was better than that for non-coding regions.
TextSentencer_T91 14290-14410 Sentence denotes The subsequent phylogenetic tree also located this sequence to DENV-1 branch, confirming the percentage coverage result.
TextSentencer_T91 14290-14410 Sentence denotes The subsequent phylogenetic tree also located this sequence to DENV-1 branch, confirming the percentage coverage result.
TextSentencer_T92 14411-14679 Sentence denotes As no other pathogens were identified with VIP pipeline, subsequent confirmatory real-time PCR assay was carried out to verify the VIP's result, and the real-time PCR correctly amplified the genome of DENV-1, which supported the primary conclusion of DENV-1 infection.
TextSentencer_T92 14411-14679 Sentence denotes As no other pathogens were identified with VIP pipeline, subsequent confirmatory real-time PCR assay was carried out to verify the VIP's result, and the real-time PCR correctly amplified the genome of DENV-1, which supported the primary conclusion of DENV-1 infection.
TextSentencer_T93 14680-14742 Sentence denotes The patient recovered spontaneously without any complications.
TextSentencer_T93 14680-14742 Sentence denotes The patient recovered spontaneously without any complications.
TextSentencer_T94 14743-14853 Sentence denotes The state-of-the-art NGS technology is becoming mature and is increasingly introduced in routine laboratories.
TextSentencer_T94 14743-14853 Sentence denotes The state-of-the-art NGS technology is becoming mature and is increasingly introduced in routine laboratories.
TextSentencer_T95 14854-14963 Sentence denotes The price of NGS is falling dramatically in some cases below the price of traditional identification methods.
TextSentencer_T95 14854-14963 Sentence denotes The price of NGS is falling dramatically in some cases below the price of traditional identification methods.
TextSentencer_T96 14964-15071 Sentence denotes NGS has emerged as one of the most promising strategies for the detection of pathogens in clinical samples.
TextSentencer_T96 14964-15071 Sentence denotes NGS has emerged as one of the most promising strategies for the detection of pathogens in clinical samples.
TextSentencer_T97 15072-15162 Sentence denotes Viral metagenomics provides comprehensive viral pathogen identification in a single assay.
TextSentencer_T97 15072-15162 Sentence denotes Viral metagenomics provides comprehensive viral pathogen identification in a single assay.
TextSentencer_T98 15163-15360 Sentence denotes However, bioinformatics analysis for viral metagenomics has been the bottleneck when deploying NGS for virus identification due to the absence of a designated bioinformatician in most laboratories.
TextSentencer_T98 15163-15360 Sentence denotes However, bioinformatics analysis for viral metagenomics has been the bottleneck when deploying NGS for virus identification due to the absence of a designated bioinformatician in most laboratories.
TextSentencer_T99 15361-15537 Sentence denotes Our goal is to establish a one-touch computational pipeline in routine laboratories for virus identification from metagenomics NGS data generated from complex clinical samples.
TextSentencer_T99 15361-15537 Sentence denotes Our goal is to establish a one-touch computational pipeline in routine laboratories for virus identification from metagenomics NGS data generated from complex clinical samples.
TextSentencer_T100 15538-15625 Sentence denotes Latest described SURPI was an ultrafast computational tool for pathogen identification.
TextSentencer_T100 15538-15625 Sentence denotes Latest described SURPI was an ultrafast computational tool for pathogen identification.
TextSentencer_T101 15626-15712 Sentence denotes Both SURPI and VIP shared generally the same strategy "subtraction to identification".
TextSentencer_T101 15626-15712 Sentence denotes Both SURPI and VIP shared generally the same strategy "subtraction to identification".
TextSentencer_T102 15713-15852 Sentence denotes SURPI searched against the entire both nucleotide and amino acid collection from NCBI (nt/nr) for comprehensive identification of pathogen.
TextSentencer_T102 15713-15852 Sentence denotes SURPI searched against the entire both nucleotide and amino acid collection from NCBI (nt/nr) for comprehensive identification of pathogen.
TextSentencer_T103 15853-15978 Sentence denotes Coverage map, a very attractive feature provided in SURPI, was also carried out in VIP with strategy as previously described.
TextSentencer_T103 15853-15978 Sentence denotes Coverage map, a very attractive feature provided in SURPI, was also carried out in VIP with strategy as previously described.
TextSentencer_T104 15979-16042 Sentence denotes The major difference was the way to select the close reference.
TextSentencer_T104 15979-16042 Sentence denotes The major difference was the way to select the close reference.
TextSentencer_T105 16043-16153 Sentence denotes SURPI would map to all nucleotide reference sequences corresponding to that genus picked out during alignment.
TextSentencer_T105 16043-16153 Sentence denotes SURPI would map to all nucleotide reference sequences corresponding to that genus picked out during alignment.
TextSentencer_T106 16154-16334 Sentence denotes VIP will choose a close reference based on abundance of reference genome due to the proposed hypothesis that the genome coverage percentage was alongside with the sequencing depth.
TextSentencer_T106 16154-16334 Sentence denotes VIP will choose a close reference based on abundance of reference genome due to the proposed hypothesis that the genome coverage percentage was alongside with the sequencing depth.
TextSentencer_T107 16335-16387 Sentence denotes The approaches for assembly share little similarity.
TextSentencer_T107 16335-16387 Sentence denotes The approaches for assembly share little similarity.
TextSentencer_T108 16388-16494 Sentence denotes VIP implemented the "classification to assembly" while SURPI used the entire data set to perform assembly.
TextSentencer_T108 16388-16494 Sentence denotes VIP implemented the "classification to assembly" while SURPI used the entire data set to perform assembly.
TextSentencer_T109 16495-16735 Sentence denotes Results suggested higher N50 were achieved by VIP. (Data not shown) Compare to other pathogens, like bacteria, fungi and parasite, most viral pathogens are RNA viruses which suggested the potentially high mutation rate or divergent genomes.
TextSentencer_T109 16495-16735 Sentence denotes Results suggested higher N50 were achieved by VIP. (Data not shown) Compare to other pathogens, like bacteria, fungi and parasite, most viral pathogens are RNA viruses which suggested the potentially high mutation rate or divergent genomes.
TextSentencer_T110 16736-16912 Sentence denotes It could be impossible to find a suitable reference genome during alignment procedure in some cases which undermined the coverage map and underscored the phylogenetic analysis.
TextSentencer_T110 16736-16912 Sentence denotes It could be impossible to find a suitable reference genome during alignment procedure in some cases which undermined the coverage map and underscored the phylogenetic analysis.
TextSentencer_T111 16913-17072 Sentence denotes The construction of a phylogenetic tree allows us to visualize the underlying genealogy between the candidate divergent virus and existing reference sequences.
TextSentencer_T111 16913-17072 Sentence denotes The construction of a phylogenetic tree allows us to visualize the underlying genealogy between the candidate divergent virus and existing reference sequences.
TextSentencer_T112 17073-17240 Sentence denotes The overall effective de novo assemblies benefitted from the classification method in VIP generated longer and accurate contig to help construct the phylogenetic tree.
TextSentencer_T112 17073-17240 Sentence denotes The overall effective de novo assemblies benefitted from the classification method in VIP generated longer and accurate contig to help construct the phylogenetic tree.
TextSentencer_T113 17241-17437 Sentence denotes Nevertheless, SURPI was proved to provide more comprehensive information for pathogen identification by searching against nt/nr while VIP focused on viruses. (see Supplementary Table S3 and S4) .
TextSentencer_T113 17241-17437 Sentence denotes Nevertheless, SURPI was proved to provide more comprehensive information for pathogen identification by searching against nt/nr while VIP focused on viruses. (see Supplementary Table S3 and S4) .
TextSentencer_T114 17438-17707 Sentence denotes To our knowledge, VIP is the first pipeline to provide both genome coverage map and phylogenetic analysis for virus identification from metagenomics data and has been rigorously tested across multiple clinical sample types representing a variety of infectious diseases.
TextSentencer_T114 17438-17707 Sentence denotes To our knowledge, VIP is the first pipeline to provide both genome coverage map and phylogenetic analysis for virus identification from metagenomics data and has been rigorously tested across multiple clinical sample types representing a variety of infectious diseases.
TextSentencer_T115 17708-17834 Sentence denotes VIP can process the NGS datasets with different NGS formats of different read lengths which suggested the universality of VIP.
TextSentencer_T115 17708-17834 Sentence denotes VIP can process the NGS datasets with different NGS formats of different read lengths which suggested the universality of VIP.
TextSentencer_T116 17835-18080 Sentence denotes Notably, VIP can efficiently handle NGS data generated from animals and complex metagenomics samples such as stool and respiratory secretions, which are exposed to the environment and contain a large proportion of non-host sequences ( Table 1 ).
TextSentencer_T116 17835-18080 Sentence denotes Notably, VIP can efficiently handle NGS data generated from animals and complex metagenomics samples such as stool and respiratory secretions, which are exposed to the environment and contain a large proportion of non-host sequences ( Table 1 ).
TextSentencer_T117 18081-18252 Sentence denotes The two-step classification approach achieved high sensitivity and specificity where virus mutation rate were less than 20%. (Figure 2) Scientific RepoRts | 6:23774 | DOI:
TextSentencer_T117 18081-18252 Sentence denotes The two-step classification approach achieved high sensitivity and specificity where virus mutation rate were less than 20%. (Figure 2) Scientific RepoRts | 6:23774 | DOI:
TextSentencer_T118 18253-18270 Sentence denotes 10.1038/srep23774
TextSentencer_T118 18253-18270 Sentence denotes 10.1038/srep23774
TextSentencer_T119 18271-18428 Sentence denotes Finally, VIP performs classification and multiple k-mer de novo assembly strategy to generate longer contigs for identification of divergent viral sequences.
TextSentencer_T119 18271-18428 Sentence denotes Finally, VIP performs classification and multiple k-mer de novo assembly strategy to generate longer contigs for identification of divergent viral sequences.
TextSentencer_T120 18429-18634 Sentence denotes Results of VIP can be easily accessible through web browsers providing intuitional information including the summary table, genome coverage map and phylogenetic tree figure. (see Supplementary Figure S1 ).
TextSentencer_T120 18429-18634 Sentence denotes Results of VIP can be easily accessible through web browsers providing intuitional information including the summary table, genome coverage map and phylogenetic tree figure. (see Supplementary Figure S1 ).
TextSentencer_T121 18635-18767 Sentence denotes Practical application indicates that VIP can be performed within a timeframe suitable for actionable responses in clinical settings.
TextSentencer_T121 18635-18767 Sentence denotes Practical application indicates that VIP can be performed within a timeframe suitable for actionable responses in clinical settings.
TextSentencer_T122 18768-18907 Sentence denotes In 2014, there was a severe Dengue outbreak in Guangzhou, Guangdong province, causing more than 40 thousand laboratory confirmed cases 28 .
TextSentencer_T122 18768-18907 Sentence denotes In 2014, there was a severe Dengue outbreak in Guangzhou, Guangdong province, causing more than 40 thousand laboratory confirmed cases 28 .
TextSentencer_T123 18908-19430 Sentence denotes We used IonTorrent PGM platform to sequence one unknown sample from this outbreak, the new developed VIP was then utilized to analyze the sequencing data, and we successfully obtained the nearly complete DENV-1 genome and the result was confirmed by traditional detection method, which indicates the potential applications in the rapid response to emergency outbreak. ( Figure 3A) In addition, we also applied NGS and VIP to analyze the first imported MERS-CoV (Middle East Respiratory Syndrome Coronavirus) case in China.
TextSentencer_T123 18908-19430 Sentence denotes We used IonTorrent PGM platform to sequence one unknown sample from this outbreak, the new developed VIP was then utilized to analyze the sequencing data, and we successfully obtained the nearly complete DENV-1 genome and the result was confirmed by traditional detection method, which indicates the potential applications in the rapid response to emergency outbreak. ( Figure 3A) In addition, we also applied NGS and VIP to analyze the first imported MERS-CoV (Middle East Respiratory Syndrome Coronavirus) case in China.
TextSentencer_T124 19431-19533 Sentence denotes Within 36 hr, the full genome of MERS-CoV was obtained and confirmed by reverse transcription-PCR 29 .
TextSentencer_T124 19431-19533 Sentence denotes Within 36 hr, the full genome of MERS-CoV was obtained and confirmed by reverse transcription-PCR 29 .
TextSentencer_T125 19534-19595 Sentence denotes VIP also makes significant contributions for virus discovery.
TextSentencer_T125 19534-19595 Sentence denotes VIP also makes significant contributions for virus discovery.
TextSentencer_T126 19596-19724 Sentence denotes We recently analyzed cell culture supernatants and found considerable amount of reads were classified under the orbivirus genus.
TextSentencer_T126 19596-19724 Sentence denotes We recently analyzed cell culture supernatants and found considerable amount of reads were classified under the orbivirus genus.
TextSentencer_T127 19725-19816 Sentence denotes The orbivirus reads, however, shared less than 50% identity with several specific segments.
TextSentencer_T127 19725-19816 Sentence denotes The orbivirus reads, however, shared less than 50% identity with several specific segments.
TextSentencer_T128 19817-19917 Sentence denotes The phylogenetic report showed the one of the contigs was included in the branch of Tibet orbivirus.
TextSentencer_T128 19817-19917 Sentence denotes The phylogenetic report showed the one of the contigs was included in the branch of Tibet orbivirus.
TextSentencer_T129 19918-20007 Sentence denotes The de novo assembly results from VIP provided insights to design virus-specific primers.
TextSentencer_T129 19918-20007 Sentence denotes The de novo assembly results from VIP provided insights to design virus-specific primers.
TextSentencer_T130 20008-20151 Sentence denotes Following specific PCR to confirm the assembly result and fill the gaps, nearly full genome of this virus were recovered (manuscript accepted).
TextSentencer_T130 20008-20151 Sentence denotes Following specific PCR to confirm the assembly result and fill the gaps, nearly full genome of this virus were recovered (manuscript accepted).
TextSentencer_T131 20152-20287 Sentence denotes Another powerful application by NGS for virus identification is a single step of high-throughput parallel sequencing of small RNAs 30 .
TextSentencer_T131 20152-20287 Sentence denotes Another powerful application by NGS for virus identification is a single step of high-throughput parallel sequencing of small RNAs 30 .
TextSentencer_T132 20288-20484 Sentence denotes The datasets available online 30 were subjected to VIP and the results were not only corresponding to the literature report also expanding the candidate virus types. (see Supplementary Table S2) .
TextSentencer_T132 20288-20484 Sentence denotes The datasets available online 30 were subjected to VIP and the results were not only corresponding to the literature report also expanding the candidate virus types. (see Supplementary Table S2) .
TextSentencer_T133 20485-20691 Sentence denotes Taken together, our results demonstrate that the proposed VIP combined with NGS has the advantages of simplicity, rapidity, universality and feasibility in the applications of virus detection and discovery.
TextSentencer_T133 20485-20691 Sentence denotes Taken together, our results demonstrate that the proposed VIP combined with NGS has the advantages of simplicity, rapidity, universality and feasibility in the applications of virus detection and discovery.
TextSentencer_T134 20692-20866 Sentence denotes In addition to maintaining the software, we are currently upgrading VIP including development of a user-friendly graphical interface (GUI) based on stand-alone web interface.
TextSentencer_T134 20692-20866 Sentence denotes In addition to maintaining the software, we are currently upgrading VIP including development of a user-friendly graphical interface (GUI) based on stand-alone web interface.
TextSentencer_T135 20867-21019 Sentence denotes Therefore, VIP shows the great potential to be standardized and readily and freely accessible to a wider audience of scientists in routine laboratories.
TextSentencer_T135 20867-21019 Sentence denotes Therefore, VIP shows the great potential to be standardized and readily and freely accessible to a wider audience of scientists in routine laboratories.
TextSentencer_T136 21020-21134 Sentence denotes Bioinformatics analysis is no longer the weak link when applying NGS as a diagnostic tool for infectious diseases.
TextSentencer_T136 21020-21134 Sentence denotes Bioinformatics analysis is no longer the weak link when applying NGS as a diagnostic tool for infectious diseases.
TextSentencer_T137 21135-21255 Sentence denotes The VIP is comprised of a series of Shell, Python, and Perl scripts in Linux and incorporates several open-source tools.
TextSentencer_T137 21135-21255 Sentence denotes The VIP is comprised of a series of Shell, Python, and Perl scripts in Linux and incorporates several open-source tools.
TextSentencer_T138 21256-21358 Sentence denotes VIP has a set of fixed external software and database dependencies and user-defined custom parameters.
TextSentencer_T138 21256-21358 Sentence denotes VIP has a set of fixed external software and database dependencies and user-defined custom parameters.
TextSentencer_T139 21359-21526 Sentence denotes The pipeline accepts cross-platform results generated from 454, ion torrent and Illumina with a variety of formats such as FastQ, FastA, SAM and BAM alignment formats.
TextSentencer_T139 21359-21526 Sentence denotes The pipeline accepts cross-platform results generated from 454, ion torrent and Illumina with a variety of formats such as FastQ, FastA, SAM and BAM alignment formats.
TextSentencer_T140 21527-21616 Sentence denotes Reads are handled by concatenating the files into a single file for streamlined analysis.
TextSentencer_T140 21527-21616 Sentence denotes Reads are handled by concatenating the files into a single file for streamlined analysis.
TextSentencer_T141 21617-21649 Sentence denotes Data import and quality control.
TextSentencer_T141 21617-21649 Sentence denotes Data import and quality control.
TextSentencer_T142 21650-21772 Sentence denotes Raw NGS short read data can be imported by trans-formatted into FastQ format using PICARD (http://picard.sourceforge.net).
TextSentencer_T142 21650-21772 Sentence denotes Raw NGS short read data can be imported by trans-formatted into FastQ format using PICARD (http://picard.sourceforge.net).
TextSentencer_T143 21773-21831 Sentence denotes VIP will determine the encoding version of the input data.
TextSentencer_T143 21773-21831 Sentence denotes VIP will determine the encoding version of the input data.
TextSentencer_T144 21832-22021 Sentence denotes This is necessary to make sure those differences in the way the quality scores were generated from different sequencing platforms are properly taken into consideration during preprocessing.
TextSentencer_T144 21832-22021 Sentence denotes This is necessary to make sure those differences in the way the quality scores were generated from different sequencing platforms are properly taken into consideration during preprocessing.
TextSentencer_T145 22022-22068 Sentence denotes VIP can also accept FastA format raw NGS data.
TextSentencer_T145 22022-22068 Sentence denotes VIP can also accept FastA format raw NGS data.
TextSentencer_T146 22069-22199 Sentence denotes The quality control step, however, will only perform sequence-based strategies, such as the complexity and length as main factors.
TextSentencer_T146 22069-22199 Sentence denotes The quality control step, however, will only perform sequence-based strategies, such as the complexity and length as main factors.
TextSentencer_T147 22200-22415 Sentence denotes Generally, the quality control step is comprised of trimming low-quality and adapter reads, removing low-complexity sequences using the DUST algorithm and retaining reads of trimmed length > 20 bp using PRINSEQ 31 .
TextSentencer_T147 22200-22415 Sentence denotes Generally, the quality control step is comprised of trimming low-quality and adapter reads, removing low-complexity sequences using the DUST algorithm and retaining reads of trimmed length > 20 bp using PRINSEQ 31 .
TextSentencer_T148 22416-22537 Sentence denotes In fast mode, Bowtie2 alignments are first performed against the host DB followed by removing the host-related sequences.
TextSentencer_T148 22416-22537 Sentence denotes In fast mode, Bowtie2 alignments are first performed against the host DB followed by removing the host-related sequences.
TextSentencer_T149 22538-22596 Sentence denotes The remaining reads are subject to ViPR/IRD nucleotide DB.
TextSentencer_T149 22538-22596 Sentence denotes The remaining reads are subject to ViPR/IRD nucleotide DB.
TextSentencer_T150 22597-22709 Sentence denotes In sense mode, the initial alignment against host DB and bacteria DB is followed by subtraction of reads mapped.
TextSentencer_T150 22597-22709 Sentence denotes In sense mode, the initial alignment against host DB and bacteria DB is followed by subtraction of reads mapped.
TextSentencer_T151 22710-22787 Sentence denotes Background-subtracted reads are then subject to a virus subset of NCBI nt DB.
TextSentencer_T151 22710-22787 Sentence denotes Background-subtracted reads are then subject to a virus subset of NCBI nt DB.
TextSentencer_T152 22788-22830 Sentence denotes Extensive classification and coverage map.
TextSentencer_T152 22788-22830 Sentence denotes Extensive classification and coverage map.
TextSentencer_T153 22831-22932 Sentence denotes Previous reports suggested that viruses have the potential to mutate rapidly or jump between species.
TextSentencer_T153 22831-22932 Sentence denotes Previous reports suggested that viruses have the potential to mutate rapidly or jump between species.
TextSentencer_T154 22933-23043 Sentence denotes Reads from mutation region of these viruses might be unclassified or classified into other species or strains.
TextSentencer_T154 22933-23043 Sentence denotes Reads from mutation region of these viruses might be unclassified or classified into other species or strains.
TextSentencer_T155 23044-23198 Sentence denotes In order to avoid the misclassifications caused by mutations, we applied a two-steps computational alignment strategy for classification in VIP (Fig. 4) .
TextSentencer_T155 23044-23198 Sentence denotes In order to avoid the misclassifications caused by mutations, we applied a two-steps computational alignment strategy for classification in VIP (Fig. 4) .
TextSentencer_T156 23199-23326 Sentence denotes In the first step, all matched reads will be assigned to a specific gene identifier (GI) after nucleotide/amino acid alignment.
TextSentencer_T156 23199-23326 Sentence denotes In the first step, all matched reads will be assigned to a specific gene identifier (GI) after nucleotide/amino acid alignment.
TextSentencer_T157 23327-23675 Sentence denotes These reads are therefore taxonomically classified to genus level by lookup of matched GI from the NCBI taxonomy database by SQL (Structured Query Language). i.e According to the GI, the scientific names for each reference records, which are composed of the species, genus and family information, are achieved and appended to the alignment results.
TextSentencer_T157 23327-23675 Sentence denotes These reads are therefore taxonomically classified to genus level by lookup of matched GI from the NCBI taxonomy database by SQL (Structured Query Language). i.e According to the GI, the scientific names for each reference records, which are composed of the species, genus and family information, are achieved and appended to the alignment results.
TextSentencer_T158 23676-23787 Sentence denotes Secondly, reads classified under genus are automatically mapped to the most likely reference genome as follows.
TextSentencer_T158 23676-23787 Sentence denotes Secondly, reads classified under genus are automatically mapped to the most likely reference genome as follows.
TextSentencer_T159 23788-23921 Sentence denotes Abundance of reference sequences that are selected during nucleotide alignment corresponding to that genus are calculated and sorted.
TextSentencer_T159 23788-23921 Sentence denotes Abundance of reference sequences that are selected during nucleotide alignment corresponding to that genus are calculated and sorted.
TextSentencer_T160 23922-24047 Sentence denotes Here we hypothesized the genome coverage percentage was alongside with the sequencing depth for specific reference sequences.
TextSentencer_T160 23922-24047 Sentence denotes Here we hypothesized the genome coverage percentage was alongside with the sequencing depth for specific reference sequences.
TextSentencer_T161 24048-24151 Sentence denotes In other words the higher abundance of a genome suggested the higher possibility to recover its genome.
TextSentencer_T161 24048-24151 Sentence denotes In other words the higher abundance of a genome suggested the higher possibility to recover its genome.
TextSentencer_T162 24152-24285 Sentence denotes All the reference sequences with the following key words are kept: (1) complete genomes; (2) complete sequences; or (3) complete cds.
TextSentencer_T162 24152-24285 Sentence denotes All the reference sequences with the following key words are kept: (1) complete genomes; (2) complete sequences; or (3) complete cds.
TextSentencer_T163 24286-24421 Sentence denotes Assigned reads are directly mapped to all nucleotide reference sequences selected using optimal BLASTn at reward/penalty score (1/− 1).
TextSentencer_T163 24286-24421 Sentence denotes Assigned reads are directly mapped to all nucleotide reference sequences selected using optimal BLASTn at reward/penalty score (1/− 1).
TextSentencer_T164 24422-24509 Sentence denotes The optimal score strategy is most suitable for sequences with low conserved ratio 32 .
TextSentencer_T164 24422-24509 Sentence denotes The optimal score strategy is most suitable for sequences with low conserved ratio 32 .
TextSentencer_T165 24510-24592 Sentence denotes For each genus, coverage map(s) for the reference sequence(s) were then generated.
TextSentencer_T165 24510-24592 Sentence denotes For each genus, coverage map(s) for the reference sequence(s) were then generated.
TextSentencer_T166 24593-24657 Sentence denotes Phylogenetic analysis and multiple k-mer based de novo assembly.
TextSentencer_T166 24593-24657 Sentence denotes Phylogenetic analysis and multiple k-mer based de novo assembly.
TextSentencer_T167 24658-24803 Sentence denotes The construction of a phylogenetic tree allows us to visualize the underlying genealogy between the contiguous sequences and reference sequences.
TextSentencer_T167 24658-24803 Sentence denotes The construction of a phylogenetic tree allows us to visualize the underlying genealogy between the contiguous sequences and reference sequences.
TextSentencer_T168 24804-24959 Sentence denotes In order to perform a phylogenetic analysis of candidate viruses in a certain viral genus, a backbone with high quality and wide spectrum is indispensable.
TextSentencer_T168 24804-24959 Sentence denotes In order to perform a phylogenetic analysis of candidate viruses in a certain viral genus, a backbone with high quality and wide spectrum is indispensable.
TextSentencer_T169 24960-25191 Sentence denotes For a genus, sequences with Refseq standard sunder that genus and the reference sequence which is used to generate the coverage map are selected to carry out multiple sequence alignment to build a backbone using MAFFT 33 (Fig. 5) .
TextSentencer_T169 24960-25191 Sentence denotes For a genus, sequences with Refseq standard sunder that genus and the reference sequence which is used to generate the coverage map are selected to carry out multiple sequence alignment to build a backbone using MAFFT 33 (Fig. 5) .
TextSentencer_T170 25192-25314 Sentence denotes The de novo assembly step benefits from the classification method in VIP for significant reduction of complexity of reads.
TextSentencer_T170 25192-25314 Sentence denotes The de novo assembly step benefits from the classification method in VIP for significant reduction of complexity of reads.
TextSentencer_T171 25315-25597 Sentence denotes Still de novo assemblies from virus samples, especially RNA viruses, into a genome sequence is challenging due to extremely uneven read depth distribution caused by amplification bias in the inevitable reverse transcription and PCR amplification process during library preparations.
TextSentencer_T171 25315-25597 Sentence denotes Still de novo assemblies from virus samples, especially RNA viruses, into a genome sequence is challenging due to extremely uneven read depth distribution caused by amplification bias in the inevitable reverse transcription and PCR amplification process during library preparations.
TextSentencer_T172 25598-25726 Sentence denotes We present the Multiple-k method in which various k-mer lengths are used for de novo assembly with Velvet-Oases [34] [35] [36] .
TextSentencer_T172 25598-25726 Sentence denotes We present the Multiple-k method in which various k-mer lengths are used for de novo assembly with Velvet-Oases [34] [35] [36] .
TextSentencer_T173 25727-26015 Sentence denotes In case that sparse reads do not overlap sufficiently to permit de novo assembly into longer contiguous sequences, assigned reads and contigs are retained if they are the most appropriate empirical >1.5 × longer than the average length of the candidate Scientific RepoRts | 6:23774 | DOI:
TextSentencer_T173 25727-26015 Sentence denotes In case that sparse reads do not overlap sufficiently to permit de novo assembly into longer contiguous sequences, assigned reads and contigs are retained if they are the most appropriate empirical >1.5 × longer than the average length of the candidate Scientific RepoRts | 6:23774 | DOI:
TextSentencer_T174 26016-26040 Sentence denotes 10.1038/srep23774 reads.
TextSentencer_T174 26016-26040 Sentence denotes 10.1038/srep23774 reads.
TextSentencer_T175 26041-26265 Sentence denotes Finally the largest contig after de novo assembly is added into the backbone to generate phylogenetic tree by unweighted pair-group method with arithmetic means (UPGMA) and visualized by Environment for Tree Exploration 37 .
TextSentencer_T175 26041-26265 Sentence denotes Finally the largest contig after de novo assembly is added into the backbone to generate phylogenetic tree by unweighted pair-group method with arithmetic means (UPGMA) and visualized by Environment for Tree Exploration 37 .
TextSentencer_T176 26266-26285 Sentence denotes Reference database.
TextSentencer_T176 26266-26285 Sentence denotes Reference database.
TextSentencer_T177 26286-26543 Sentence denotes A 3.8 gigabase (Gb) human nucleotide database (human DB) was constructed from a combination of human genomic DNA, unlocalized DNA (GRCh38/hg38), ribosomal RNA (rRNA, RefSeq), RNA (RefSeq), and mitochondrial DNA (RefSeq) sequences in NCBI as of July of 2015.
TextSentencer_T177 26286-26543 Sentence denotes A 3.8 gigabase (Gb) human nucleotide database (human DB) was constructed from a combination of human genomic DNA, unlocalized DNA (GRCh38/hg38), ribosomal RNA (rRNA, RefSeq), RNA (RefSeq), and mitochondrial DNA (RefSeq) sequences in NCBI as of July of 2015.
TextSentencer_T178 26544-26668 Sentence denotes The viral nucleotide databases in fast mode were constructed from a combination of sequences in ViPR/IRD as of July of 2015.
TextSentencer_T178 26544-26668 Sentence denotes The viral nucleotide databases in fast mode were constructed from a combination of sequences in ViPR/IRD as of July of 2015.
TextSentencer_T179 26669-26828 Sentence denotes The viral nucleotide DB in sense mode consisted of 87071 entries was constructed by collection of all Refseq viral complete genomes and their neighbor genomes.
TextSentencer_T179 26669-26828 Sentence denotes The viral nucleotide DB in sense mode consisted of 87071 entries was constructed by collection of all Refseq viral complete genomes and their neighbor genomes.
TextSentencer_T180 26829-26954 Sentence denotes The neighbor genomes were the complete viral nucleotide sequences which were non-RefSeq recorded from DDBJ, EMBL and GenBank.
TextSentencer_T180 26829-26954 Sentence denotes The neighbor genomes were the complete viral nucleotide sequences which were non-RefSeq recorded from DDBJ, EMBL and GenBank.
TextSentencer_T181 26955-27020 Sentence denotes The viral protein databases were constructed from NCBI Refseq DB.
TextSentencer_T181 26955-27020 Sentence denotes The viral protein databases were constructed from NCBI Refseq DB.
TextSentencer_T182 27021-27158 Sentence denotes The bacterial DB in sense mode was constructed from the collections of unique genome segments at species level within GOTTCHA 38 package.
TextSentencer_T182 27021-27158 Sentence denotes The bacterial DB in sense mode was constructed from the collections of unique genome segments at species level within GOTTCHA 38 package.
TextSentencer_T183 27159-27168 Sentence denotes Hardware.
TextSentencer_T183 27159-27168 Sentence denotes Hardware.
TextSentencer_T184 27169-27284 Sentence denotes VIP is tested on a common desktop PC with a 3.4 GHz Intel Core i7-4770 and 16 GB of RAM (running Ubuntu 14.04 LTS).
TextSentencer_T184 27169-27284 Sentence denotes VIP is tested on a common desktop PC with a 3.4 GHz Intel Core i7-4770 and 16 GB of RAM (running Ubuntu 14.04 LTS).
TextSentencer_T185 27285-27394 Sentence denotes Minimum hardware requirements for running VIP include a 4 GB of RAM, PC running Ubuntu 14.04 LTS (preferred).
TextSentencer_T185 27285-27394 Sentence denotes Minimum hardware requirements for running VIP include a 4 GB of RAM, PC running Ubuntu 14.04 LTS (preferred).
TextSentencer_T186 27395-27464 Sentence denotes VIP and its external dependencies require about 500 MB of disk space.
TextSentencer_T186 27395-27464 Sentence denotes VIP and its external dependencies require about 500 MB of disk space.
TextSentencer_T187 27465-27515 Sentence denotes Reference data requires about 70 GB of disk space.
TextSentencer_T187 27465-27515 Sentence denotes Reference data requires about 70 GB of disk space.
TextSentencer_T188 27516-27620 Sentence denotes During VIP runtime, up to 10 × the size of the input file may be needed as additional temporary storage.
TextSentencer_T188 27516-27620 Sentence denotes During VIP runtime, up to 10 × the size of the input file may be needed as additional temporary storage.
TextSentencer_T189 27621-27749 Sentence denotes All high quality reads were subject to viral database using nucleotide alignment (Bowtie2) or amino acid alignment (RAPSearch2).
TextSentencer_T189 27621-27749 Sentence denotes All high quality reads were subject to viral database using nucleotide alignment (Bowtie2) or amino acid alignment (RAPSearch2).
TextSentencer_T190 27750-27825 Sentence denotes Reads would be classified into each viral genus based on alignment results.
TextSentencer_T190 27750-27825 Sentence denotes Reads would be classified into each viral genus based on alignment results.
TextSentencer_T191 27826-28014 Sentence denotes VIP would choose a close reference based on abundance of reference genome due to the proposed hypothesis that the genome coverage percentage was alongside with the genome sequencing depth.
TextSentencer_T191 27826-28014 Sentence denotes VIP would choose a close reference based on abundance of reference genome due to the proposed hypothesis that the genome coverage percentage was alongside with the genome sequencing depth.
TextSentencer_T192 28015-28164 Sentence denotes Candidate reads would map against to the selected reference with optimal BLASTn at reward/penalty score (1/− 1), followed by coverage map generation.
TextSentencer_T192 28015-28164 Sentence denotes Candidate reads would map against to the selected reference with optimal BLASTn at reward/penalty score (1/− 1), followed by coverage map generation.
TextSentencer_T193 28165-28175 Sentence denotes Figure 5 .
TextSentencer_T193 28165-28175 Sentence denotes Figure 5 .
TextSentencer_T194 28176-28321 Sentence denotes Sequences with Refseq standards under that genus and reference sequence which is used to generate the coverage map are used to generate backbone.
TextSentencer_T194 28176-28321 Sentence denotes Sequences with Refseq standards under that genus and reference sequence which is used to generate the coverage map are used to generate backbone.
TextSentencer_T195 28322-28415 Sentence denotes Multiple-k method with various k-mer lengths are used for de novo assembly with Velvet-Oases.
TextSentencer_T195 28322-28415 Sentence denotes Multiple-k method with various k-mer lengths are used for de novo assembly with Velvet-Oases.
TextSentencer_T196 28416-28580 Sentence denotes Best contig is then added into the backbone with unweighted pairgroup method with arithmetic means (UPGMA) and visualized by Environment for Tree Exploration (ETE).
TextSentencer_T196 28416-28580 Sentence denotes Best contig is then added into the backbone with unweighted pairgroup method with arithmetic means (UPGMA) and visualized by Environment for Tree Exploration (ETE).