PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T210	0-34	Sentence	denotes	Sequence Processing and Filtering.
T211	35-255	Sentence	denotes	All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.
T212	256-337	Sentence	denotes	Sequences were processed using the Biostrings package (version 2.48.0) in R (49).
T213	338-533	Sentence	denotes	Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.
T214	534-696	Sentence	denotes	Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).
T215	697-872	Sentence	denotes	Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.
T216	873-1177	Sentence	denotes	Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).
T217	1178-1499	Sentence	denotes	To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51).
T218	1500-1643	Sentence	denotes	This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53).
T219	1644-1743	Sentence	denotes	Outliers were defined as sequences that had studentized residuals greater than 3, and were removed.
T220	1744-1873	Sentence	denotes	Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset.
T221	1874-2175	Sentence	denotes	To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1).
T222	2176-2375	Sentence	denotes	A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology.
T223	2376-2498	Sentence	denotes	These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day.
T224	2499-2597	Sentence	denotes	After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences.

T210

0-34

Sentence

denotes

Sequence Processing and Filtering.

T211

35-255

Sentence

denotes

All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.

T212

256-337

Sentence

denotes

Sequences were processed using the Biostrings package (version 2.48.0) in R (49).

T213

338-533

Sentence

denotes

Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.

T214

534-696

Sentence

denotes

Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).

T215

697-872

Sentence

denotes

Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.

T216

873-1177

Sentence

denotes

Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).

T217

1178-1499

Sentence

denotes

To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51).

T218

1500-1643

Sentence

denotes

This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53).

T219

1644-1743

Sentence

denotes

Outliers were defined as sequences that had studentized residuals greater than 3, and were removed.

T220

1744-1873

Sentence

denotes

Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset.

T221

1874-2175

Sentence

denotes

To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1).

T222

2176-2375

Sentence

denotes

A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology.

T223

2376-2498

Sentence

denotes

These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day.

T224

2499-2597

Sentence

denotes

After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences.

PMC:7519301 / 34579-37176 JSON TXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

PMC:7519301 / 34579-37176 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

PMC:7519301 / 34579-37176 JSON TXT 9 Projects