CORD-19:a22ceac8d5e1cfaf2e79557d3c794458d0e1cdbe JSON TXT

23 of the 2019-nCoV sequence Abstract In numerous instances, tracking the biological significance of a nucleic acid sequence 15 can be augmented through the identification of environmental niches in which the sequence 16 of interest is present. Many metagenomic datasets are now available, with deep sequencing of 17 samples from diverse biological niches. While any individual metagenomic dataset can be 18 readily queried using web-based tools, meta-searches through all such datasets are less 19 accessible. In this brief communication, we demonstrate such a meta-meta-genomic 20 approach, examining close matches to the Wuhan coronavirus 2019-nCoV in all high-21 throughput sequencing datasets in the NCBI Sequence Read Archive accessible with the 22 keyword "virome". In addition to the homology to bat coronaviruses observed in descriptions ), we note a strong 25 homology to numerous sequence reads in a metavirome dataset generated from the lungs of 26 deceased Pangolins reported by Liu et al. (Viruses 11:11, 2019, 27 http://doi.org/10.3390/v11110979). Our observations are relevant to discussions of the 28 derivation of 2019-nCoV and illustrate the utility and limitations of meta-metagenomic 29 search tools in effective and rapid characterization of potentially significant nucleic acid 30 sequences. : bioRxiv preprint 65 Read Archive (SRA) using the SRA-tools package (version 2.9.1). The latter sequence data 66 were downloaded as .sra files using the prefetch tool, with extraction to readable format 67 (.fasta.gz) using the NCBI fastq-dump tool. Each of these manipulations can fail some 68 fraction of the time. Obtaining the sequences can fail due to network issues, while extraction 69 in readable format occasionally fails for unknown reasons. Thus the workflow continually 70 requests .sra files with ncbi-prefetch until at least some type of file is obtained, followed by 71 attempts to unpack into .fasta.gz format until one such file is obtained from each .sra file. Metagenomic datasets for analysis were chosen through a keyword search of the SRA 73 : bioRxiv preprint 274 2020. Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects 275 the hypothesis of emergence as a result of a recent recombination event. Infection, 276 Genetics and Evolution 79:104212. 33 Meta-metagenomic searches allow for high-speed, low-cost identification of potentially 34 significant biological niches for sequences of interest. Introduction: 36 In the early years of nucleic acids sequencing, aggregation of the majority of published 37 DNA and RNA sequences into public sequence databases greatly aided biological hypothesis 38 generation and discovery. Search tools capable of interrogating the ever-expanding databases 39 were facilitated by creative algorithm development and software engineering, and by the ever- experiments on a terabyte scale, along with software able to search for similarity to a query 53 sequence. We find that neither of these aspects of meta-metagenomic searches is infeasible 54 with current data transfer and processing speeds. In this communication, we report the 55 results of searching the recently-described 2019-nCoV coronavirus sequence through a set of 56 metagenomic datasets with the tag "virome". alignments were carried out with standard consumer-grade computers. Sequence data: All sequence data for this analysis were downloaded from the National 63 Center for Biotechnology Information (NCBI) website, with individual sequences downloaded 64 through a web interface and metagenomic datasets downloaded from the NCBI Sequence with viral sequences, and likewise not capture every virus in the short sequence read archive. With up to 16 threads running simultaneously, total download time (prefetch) was 77 approximately 2 days. Similar time was required for conversion to gzipped fasta files. A total 78 of 9014 sequence datasets were downloaded and converted to fasta.gz files. Most files 79 contained large numbers of reads with a small fraction of files containing very little data (only 80 a few reads or reads of at most a few base pairs). The total dataset consists of 2.5TB of 81 compressed sequence data corresponding to approximately 10 13 bases. Search Software: For rapid identification of close matches among large numbers of 83 metagenomic reads, we used a simple dictionary based on the 2019-nCoV sequence (NCBI 84 MN908947.3Wuhan-Hu-1) and its reverse complement, querying every 8th k-mer along the 85 individual reads for matches to the sequence. As a reference, and to benchmark the workflow 86 further, we included several additional sequences in the query (Vaccinia virus, an arbitrary 87 segment of a flu isolate, the full sequence of bacteriophage P4, and a number of putative 88 polinton sequences from Caenorhabditis briggsae). The relatively small group of k-mers 89 being queried (<10 6 ) allows a rapid search for homologs. This was implemented in a Python 90 script run using the PyPy accelerated interpreter. We stress that this is by no means the most 91 comprehensive or fastest search for large datasets. However, it is more than sufficient to 92 rapidly find any closely matching sequence (with the downloading and conversion of the data, 93 rather than the search, being rate limiting). Figure S1 . The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.08.939660 doi: bioRxiv preprint Results: 117 To identify biological niches that might harbor viruses closely related to 2019-nCoV, 118 we searched through publicly available metaviromic datasets. We were most interested in 119 viruses with highly similar sequences, as these would likely be most useful in forming 120 hypotheses about the origin and pathology of the recent human virus. We thus set a 121 threshold requiring matching of a perfect 32-nucleotide segment with a granularity of 8 122 nucleotides in the search (i.e., interrogating the complete database of k-mers from the virus 123 with k-mers starting at nucleotide 1, 9, 17, 25, 33 of each read from the metagenomic data for 124 a perfect match). This would catch any perfect match of 39 nucleotides or greater, with some 125 homologies as short as 32 nucleotides captured depending on the precise phasing of the read. All metagenomic datasets with the keyword "virome" in NCBI SRA as of January 2020 127 were selected for analysis in a process that required approximately 2 days each for 128 downloading and conversion to readable file formats and one day for searching by k-mer 129 match on a desktop workstation computer (i7 8-core). Together the datasets included showing no matched 32-mers to 2019-nCoV. Of the datasets with matched k-mers, one was 136 from an apparent synthetic mixture of viral sequences, while the remaining were all from 137 vertebrate animal sources. The matches were from five studies: two bat-focused studies (7, 138 8), one bird-focused study (9), one small-animal-and-rodent focused study (10), and a study 139 of pangolins (11) [ Table 1 ]. The abundance and homology of viruses within a metagenomic sample are of The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.08.939660 doi: bioRxiv preprint this thread and appear to have encountered it through a more targeted search than ours (this 206 study has since been posted online in bioRxiv) (13). As noted by Wong et al. (13) The availability of numerous paths (both targeted and agnostic) toward identification 221 of natural niches for pathogenic sequences will remain useful to the scientific community and 222 to public health, as will vigorous sharing of ideas, data, and discussion of potential origins 223 and modes of spread for epidemic pathogens. Metagenomic datasets with k=32-mer matches to MN908947.3 2019-nCoV Details of search and are described in legend to Table S1 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.08.939660 doi: bioRxiv preprint SRR10168377. This plot addresses several challenges associated with the limited sequencing data 302 available by attempting to provide the most favorable alignment of that sequence possible. To 303 maximize sensitivity in detecting potential recombination, ambiguities in which two or more reads 304 apparently disagreed (which were rare; approximately 1.2% of assigned bases) were resolved in favor 305 of "no substitution" at any position if one read matches the 2019-nCoV genome. This will provide a 306 lower bound of variation, although regions covered by a single read are still subject to amplification 307 and sequencing error. Near-perfect overlaps between reads from SRR10168377 argue that such error 308 is relatively low as agreement in those regions is 99. ## This program is a fast metasearch that will look for instances of sequence reads matching ## a reference in at least one k-mer. ## Jazz18Heap is intended to look for evidence of matches to a relatively short reference sequence ## e.g., less than 100KB, but probably workable up to several MB in a large number of high throughput sequencing experiments ## Jazz18Heap is intended for finding relatively rare sequence (not common ones) ## Jazz18Heap doesn't substitute for many tools to align and track coverage. It's main goal is rapid identification of potentially homologous sequences ( The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.08.939660 doi: bioRxiv preprint

projects that include this document

Unselected / annnotation		Selected / annnotation
CORD-19_bioRxiv_medRxiv_subset (0) CORD-19_All_docs (0) CORD-19-Sentences (116) CORD-PICO (3) CORD-19-PD-HP (0) CORD-19-PD-UBERON (1) CORD-19-PD-MONDO (8)

TAB JSON ListView MergeView

CORD-19:a22ceac8d5e1cfaf2e79557d3c794458d0e1cdbe JSONTXT

projects that include this document

CORD-19:a22ceac8d5e1cfaf2e79557d3c794458d0e1cdbe JSON TXT