CORD-19:a7804d77d1a75810af0148d8fea8acbe2941f1b9 JSON TXT

The effect of variant interference on de novo assembly for viral deep sequencing Abstract Viruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-24 generation sequencing (NGS) approach has surpassed Sanger for generating long viral sequences, yet how 25 variants affect NGS de novo assembly remains largely unexplored. Our results from >15,000 simulated 26 experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of 27 contigs. This "variant interference" (VI) is highly consistent and reproducible by ten most used de novo 28 assemblers, and occurs independent of genome length, read length, and GC content. The main driver of VI is 29 pairwise identities between viral variants. These findings were further supported by in silico simulations, 30 where selective removal of minor variant reads from clinical datasets allow the "rescue" of full viral genomes 31 from fragmented contigs. These results call for careful interpretation of contigs and contig numbers from de 32 novo assembly in viral deep sequencing. Genomic surveillance of viruses is particularly important in light of their rapid rate of evolution. Viruses 44 have higher mutation rates than cellular-based taxa, with RNA viruses having mutation rates as high as 1.5 × 45 10 −3 mutations per nucleotide, per genomic replication cycle. 4 Due to this high mutation rate, it is well 46 established that most RNA viruses exist as a swarm of quasispecies, 5 with each quasispecies containing unique 47 single nucleotide polymorphisms (SNPs). The presence of these variants plays a key role in viral adaptation. 48 49 Due to viruses' rapid evolution, a single clinical sample often contains a mixture of many closely related 50 viruses. Viral quasispecies are mainly derived from intra-host evolution, with RNA viruses such as poliovirus, 51 human immunodeficiency virus (HIV), hepatitis C (HCV), influenza, dengue, and West Nile viruses maintaining 52 diverse quasispecies populations within a host. 6, 7, 8, 9, 10, 11, 12, 13 Conversely, the term "viral strains" often refers 53 to different lineages of viruses found in separate hosts, or a co-infection of viruses in the same host due to 54 multiple infection events. As a result, sequence divergence is usually higher when comparing viral strains 55 compared to quasispecies. In this study, we use the term "variant" to encompass both quasispecies and 56 strains regardless of how the variants originated in the biological samples. 57 58 Since many sequencing technologies produce reads that are significantly shorter than the target 59 genome size, a process to construct contigs, scaffolds, and full-length genomes is needed. Reference-mapping 60 and de novo assembly are the two primary bioinformatic strategies for genome assembly. Reference-mapping 61 requires a closely-related genome as input to align reads, while de novo assembly generates contigs without 62 The rise of NGS and de novo assembler use in GenBank viral sequences 73 74 GenBank viral entries from 1982-2017 were collected and analyzed, with extensive analyses performed 75 to evaluate technologies and bioinformatics programs cited in records deposited between 2011 and 2017. 76 Through 2017, there were over 2.3 million viral entries in GenBank; however, over 70% (1.7 million) do not 77 specify a sequencing technology [Supplement Table S1 ] due to the looser data requirement in earlier years. 78 When looking at recently deposited records (2014) (2015) (2016) (2017) , the Illumina sequencing platform was the most 79 common NGS platform used for viral sequencing, with about a 2-fold increase over the next most popular NGS 80 platform [ Figure 1d & e]. When long sequences (≥2,000 nt) are considered, NGS technologies surpassed 81 Sanger in 2017 as the dominant strategy for sequencing, comprising 53.8% (14,653/27,217) of entries 82 compared to 46.2% of entries (12,564/27,217) for Sanger [ Figure 1f and Supplement Table S2 ]. 83 84 Hybrid sequencing approaches, where researchers use more than one sequencing technology to 85 generate complete viral sequences, have also become more common over the past several years. The most 86 common combination observed was 454 and Sanger (18,002 entries), likely due to the early emergence of the 87 454 technology compared to other NGS platforms [ Figure 1c and Supplement Table S3 ]. However, combining 88 Illumina with various other sequencing platforms is quite commonplace (>10,000 entries). De novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, 91 Trinity, and Velvet) have increased from less than 1% of viral sequence entries in 2012, to 20% of all viral 92 sequence entries in 2017 [ Figure 1h & i]. A similar increase was observed for reference-mapping programs 93 (i.e., Bowtie and Bowtie2), from 0.03% in 2012 to 6.5% in 2017. Multifunctional programs (Suppl. Information) 94 that offer both assembly options were the most common programs cited for the years 2013-2017, but since 95 the exact sequence assembly strategy used for these records is unknown, the contributions of de novo 96 assembly are likely underestimated. An expanded summary of the sequencing technologies and assembly 97 approaches used for viral GenBank records is available in Supplement Tables S1-S6. 98 99 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint Effect of variant assembly using popular de novo assemblers 100 101 After establishing the growing use of NGS technologies for viral sequencing, we next focused on 102 understanding how the presence of viral variants may influence de novo assembly output. We generated 247 103 simulated viral NGS datasets representing a continuum of pairwise identity (PID) between two viral variants, 104 from 75% PID (one nucleotide difference every 4 nucleotides), to 99.6% PID (one nucleotide difference every 105 250 nucleotides) [ Figure 2 ]. For Experiment 1, these datasets were assembled using 10 of the most used de 106 novo assembly programs [ Figure 2 and Supplement Figure S1a ] to evaluate their ability to assemble the two 107 variants into their own respective contigs as the PID between the variants increases. One key observation is that the assembly result can change from two (correct) contigs to many 110 (unresolvable) contigs simply by having variant reads; the presence of viral variants affected the contig 111 assembly output of all 10 assemblers tested. The output of the SPAdes, MetaSPAdes, ABySS, Cap3, and IDBA 112 assemblers shared a few commonalities, demonstrated by a conceptual model in Figure 3A . First, below a 113 certain PID, when viral variants have enough distinct nucleotides to resolve the two variant contigs, the de 114 novo assemblers produced two contigs correctly [ Figure 3 ]. We refer to this as "variant distinction" (VD), with 115 the highest pairwise identity where this occurs as the VD threshold. Above this threshold, the assemblers 116 produced tens to thousands of contigs [ Figure 3 ], a phenomenon we define as "variant interference" (VI). As 117 PID between the variants continue to increase, the de novo assemblers can no longer distinguish between the 118 variants and assembled all the reads into a single contig, a phenomenon we define as "variant singularity" 119 (VS). [ Figure 3 ]. The lowest pairwise identity where a single contig is assembled is the VS threshold. 120 121 Slight differences in the variant interference patterns (relative to the canonical variant interference 122 model) were observed for the 10 assemblers investigated. VD was observed for SPAdes, MetaSPAdes, and 123 ABySS assemblers. While it was not observed with Cap3 and IDBA with the current simulated data parameters, 124 we speculate that VD may occur at a lower PID level for these assemblers than tested in this study. The PID 125 range where VI was observed was distinct for each de novo assembler [ Figure 3 ]. During VI, SPAdes produced 126 as many as 134 contigs and ABySS produced 3,076 contigs, while MetaSPAdes, Cap3, and IDBA produced up to 127 10. 128 7 129 A different pattern was observed for Mira, Trinity, and SOAPdenovo2 assemblers. The average number 130 of contigs generated by Mira, Trinity, and SOAPdenovo2 was 5, 36, and 283, respectively across all variant PIDs 131 from 75%-99.96%. Specifically, Mira and Trinity generated fewer contigs at low PID, but produced many 132 contigs when the two variants reach 97.1% PID and 96.0% PID, respectively. For SOAPdenovo2, a larger 133 number of contigs were produced regardless of the PID. This indicates that these assemblers generally have 134 major challenges producing a single genome; this has been observed in previous studies comparing assembly 135 performance. 15 136 137 Finally, Geneious and CLC were the least affected by VI in the simulated datasets tested, returning only PID, but tens to thousands of contigs were generated at a slightly lower PID of 99.21%. This PID threshold, 158 99.21%, marked the drastic transition from VS to VI, whereas the transition from VI to VD (i.e., the VD 159 threshold) occurred at 98.99% PID [ Figure 4b ]. A correlation was observed between genome length and the 160 number of contigs produced during VI, where longer genomes returned proportionally more contigs as 161 expected as total VI occurrence should increase with length [r 2 = 0.967; p <0.0001 Figure 4b and 4c]. The read length of a given NGS dataset will vary depending on the sequencing platform and kits utilized 166 to generate the data. Since read length is an important factor for de novo assembly success, 16 we 167 hypothesized that it may also influence the ability to distinguish viral variants. For Experiment 3, using SPAdes For clinical samples, assembly of viral genomes is affected by multiple factors other than the presence 176 of variants, including sequencing error rate, host background reads, depth of genome coverage, and the 177 distribution (i.e., pattern) of genome coverage. We next utilized viral NGS data generated from four 178 picornavirus-positive clinical samples (one coxsackievirus B5, one enterovirus A71, and two parechovirus A3) 179 to explore VI in datasets representative of data that may be encountered during routine NGS. The NGS data 180 for each sample was partitioned into four bins of read data: (1) total reads after quality control (T); (2) major 181 variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background non-viral 182 reads only (MB) [ Figure 5 ]. These binned datasets were then assembled separately using three assembly 183 programs: SPAdes, Cap3, and Geneious. By comparing these manipulations, we aimed to test the hypothesis 184 that minor variants directly affect the performance of assembly through VI in real clinical NGS data. 185 186 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint Even with an adequate depth of coverage for genome reconstruction, assembly of total reads (T) in 187 11/12 experiments resulted in unresolved genome construction -resulting in numerous fragmented viral 188 contigs [ Figure 6 ]. The only exception was one experiment where one single PeV-A3 (S1) genome was 189 assembled using Cap3. When only reads from the major variant were assembled (M), full genomes were 190 obtained for all datasets using SPAdes and Cap3, and for the CV-B5 sample using Geneious. Conversely, 191 assembly of the read bins containing major and minor variants (Mm) resulted in an increased number of 192 contigs for 9 of the 12 sample and assembly software combinations tested [ Figure 6 ], indicating that VI due to Several experiments using simulated and clinical sample NGS data were performed to evaluate the 212 ability of genome assembly programs to distinguish genome variants. All assemblers investigated generated 213 fragmented assemblies when the data contained reads from two closely related variants due to "variant 214 interference" (VI). Changes in pairwise identity (PID) as small as 0.01% between the two variants triggered an 215 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint assembler to change from producing one or two contigs to producing hundreds of contigs. A quintessential 216 example of this phenomenon was the SPAdes assembly of EV-A71 sequences during the in silico experiments 217 with clinical NGS data. Assembly of major variant reads resulted in one full length contig [ Figure 5 ], whereas 218 assembly of datasets containing the major and minor variant reads (Mm and T) were characterized by a 219 number of contigs, resulting in "cobwebs" of contig fragments when visualized using Bandage [Supplement 220 Figure S2 ]. 22 Even though the de novo assembly graph linked the different contig fragments, the assembly 221 could not differentiate the multiple routes of possible contig construction. We speculate this is the main 222 reason why VI occurs in the context of de Bruijn graph assemblers. The simulated experiments suggested that genome length and read length influence VI; A longer 225 genome length will produce proportionally more contigs during VI, whereas a longer read length decreases 226 the PID range where VI occurs [ Figure 4 ]. While longer read length improves assembly, unfortunately, 227 platforms that produce long reads such as Oxford Nanopore and PacBio have higher error rates. 23 Until long 228 reads can be produced at high fidelity, researchers must continue to rely on combining long-and short-read The large number of contigs generated due to VI may be overwhelming for most researchers, and for 232 viral ecology studies, could lead to over-estimation of species richness for methods that use contig spectra to 233 infer richness, such as PHACCS or CatchAll. 24, 25, 26 This phenomenon may also impact studies differently 234 depending on the overall goal for generating viral sequence data. For example, some researchers may only be 235 concerned with generating a single major consensus genome, even when variants are detected in the data. The effects of VI could potentially be mitigated by running multiple assembly programs. A previous 243 study testing bioinformatics strategies for assembling viral NGS data found that employing sequential use of 244 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint de Bruijn graph and overlap-layout-consensus assemblers produced better assemblies. 15 We speculate that 245 this "ensemble strategy" 15 may perform better because the multiple assemblers complement one another by 246 having different VI PID thresholds. Future assembly approaches could also consider resolving the VI problem 247 by possibly discriminating the major and minor variant reads first (perhaps by coverage or SNP analysis), and 248 then assembling major and minor variant reads separately. This study aimed to understand how variants affect assembly. As an initial investigation, many 259 confounding factors were simplified for experimentation. Simulated variants studied here only depicted 260 periodic mutations, set at regular intervals. However, in real viral data, SNPs are never evenly distributed 261 across the genome, with zones of divergence and similarity. 31, 32 Other important factors which influence 262 genome assembly include sequencing error rates, presence of repetitive regions, and coverage depth. We 263 limited our experiments to keep these factors constant in order to investigate the sole effect of VI. Through 264 this exploration, we demonstrated that reads from related genome variants adversely affect de novo 265 assembly. As NGS and de novo assembly have become essential for generating full-length viral genomes, 266 future studies should investigate the combined effects of the number and relative proportion of minor 267 variants, as well as additional assembly factors (e.g., error rates) to supplement this work. 268 269 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. Simulated genomes were generated using custom scripts that randomly assign each nucleotide over a 284 designated genome length with a weighted distribution dependent on the GC content [Supplement Figure S1 ]. 285 The random genomes were then screened using NCBI BLAST to insure no similarity/identity existed to any 286 classified organism (i.e., no BLAST hits). These simulated genomes served as the initial variant genome (variant 287 1). To generate the mutated variant genomes (variant 2), a custom script was used to systematically introduce 288 evenly distributed random mutations at rates from 1 mutation in every 4 nucleotides (75% PID) to 1 mutation 289 in every 250 nucleotides (99.6% PID), incrementing by 1 nucleotide. 290 291 Following the generation of initial and mutated variant genomes, high-quality fastq reads were 292 generated using ART, 33 simulating Illumina MiSeq paired-end runs at 50X coverage with 250 nt reads, 293 DNA/RNA mean fragments size of 500, and quality score of 93. Fastq reads were combined in equal numbers 294 for the initial and mutated variants, and used as input for subsequent de novo assembly experiments 295 [Supplement Figure S1 ]. The same process was utilized to generate the artificial genomes, initial and mutated 296 variant genomes, and reads for each of the experiments. 297 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint Experiment 1: Analyzing simulated reads from variants using different de novo assembly programs 298 299 The simulated datasets containing reads from two variant genomes with nucleotide pairwise identity 300 ranging from 75%-99.6% were analyzed using 10 different genome assembly programs. Table S6 ]. The simulation settings for the reads were single-end reads, 250 nt read length, and 305 50X coverage. A total of 2,470 assemblies (247 datasets per genome X 10 assemblers) were analyzed 306 [Supplement Figure S1a ]. Cap3, and Geneious. The length of the longest contig produced from each assembly and the performance 353 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All authors contributed to the conceptualization, data analysis, preparation, and review of this manuscript. The authors declare no competing interests. 374 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint was the most apparent. When longer read lengths were used, the variant interference PID range was much 586 narrower than when shorter read lengths were used to build contigs. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint tended to be paired with Illumina more frequently compared to traditional Sanger sequencing. A small 641 number of studies even combined three or four different sequencing technologies (530 and 6 entries, 642 respectively) [Supplement Table S4 ]. Some users employed a combined approach to circumvent the inherent 643 flaws of one sequencing platform, particularly for genome finishing. 44 For example, after NGS has been used to 644 generate the majority of a RNA virus genome, RACE (Rapid amplification of cDNA ends) is typically performed 645 with Sanger to obtain the 5' or 3' termini. 45 reference-mapping assembly, the exact sequence assembly strategy used for these records is unknown, and 661 thus the contributions of both de novo assembly and reference recruitment are likely underestimated. 662 663 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/815480 doi: bioRxiv preprint

projects that include this document

Unselected / annnotation		Selected / annnotation
CORD-19_bioRxiv_medRxiv_subset (0) CORD-19_All_docs (0) CORD-19-Sentences (280) CORD-PICO (2) CORD-19-PD-HP (2) CORD-19-PD-UBERON (0) CORD-19-PD-MONDO (76)

TAB JSON ListView MergeView

CORD-19:a7804d77d1a75810af0148d8fea8acbe2941f1b9 JSONTXT

projects that include this document

CORD-19:a7804d77d1a75810af0148d8fea8acbe2941f1b9 JSON TXT