Id |
Subject |
Object |
Predicate |
Lexical cue |
T210 |
0-34 |
Sentence |
denotes |
Sequence Processing and Filtering. |
T211 |
35-255 |
Sentence |
denotes |
All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed. |
T212 |
256-337 |
Sentence |
denotes |
Sequences were processed using the Biostrings package (version 2.48.0) in R (49). |
T213 |
338-533 |
Sentence |
denotes |
Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained. |
T214 |
534-696 |
Sentence |
denotes |
Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50). |
T215 |
697-872 |
Sentence |
denotes |
Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs. |
T216 |
873-1177 |
Sentence |
denotes |
Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf). |
T217 |
1178-1499 |
Sentence |
denotes |
To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51). |
T218 |
1500-1643 |
Sentence |
denotes |
This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53). |
T219 |
1644-1743 |
Sentence |
denotes |
Outliers were defined as sequences that had studentized residuals greater than 3, and were removed. |
T220 |
1744-1873 |
Sentence |
denotes |
Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset. |
T221 |
1874-2175 |
Sentence |
denotes |
To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1). |
T222 |
2176-2375 |
Sentence |
denotes |
A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology. |
T223 |
2376-2498 |
Sentence |
denotes |
These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day. |
T224 |
2499-2597 |
Sentence |
denotes |
After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences. |