PMC:7519301 / 34579-37176 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

Id Subject Object Predicate Lexical cue
T210 0-34 Sentence denotes Sequence Processing and Filtering.
T211 35-255 Sentence denotes All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.
T212 256-337 Sentence denotes Sequences were processed using the Biostrings package (version 2.48.0) in R (49).
T213 338-533 Sentence denotes Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.
T214 534-696 Sentence denotes Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).
T215 697-872 Sentence denotes Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.
T216 873-1177 Sentence denotes Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).
T217 1178-1499 Sentence denotes To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51).
T218 1500-1643 Sentence denotes This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53).
T219 1644-1743 Sentence denotes Outliers were defined as sequences that had studentized residuals greater than 3, and were removed.
T220 1744-1873 Sentence denotes Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset.
T221 1874-2175 Sentence denotes To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1).
T222 2176-2375 Sentence denotes A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology.
T223 2376-2498 Sentence denotes These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day.
T224 2499-2597 Sentence denotes After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences.