PMC:7519301 / 34614-35756 JSONTXT 6 Projects

Annnotations TAB TSV DIC JSON TextAE Lectin_function

Id Subject Object Predicate Lexical cue
T211 0-220 Sentence denotes All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.
T212 221-302 Sentence denotes Sequences were processed using the Biostrings package (version 2.48.0) in R (49).
T213 303-498 Sentence denotes Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.
T214 499-661 Sentence denotes Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).
T215 662-837 Sentence denotes Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.
T216 838-1142 Sentence denotes Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).