Id |
Subject |
Object |
Predicate |
Lexical cue |
T206 |
0-21 |
Sentence |
denotes |
Materials and Methods |
T207 |
23-37 |
Sentence |
denotes |
Sequence Data. |
T208 |
38-102 |
Sentence |
denotes |
Sequences were downloaded from GISAID (https://www.gisaid.org/). |
T209 |
103-283 |
Sentence |
denotes |
A full list, along with the originating and submitting laboratories (GISAID_acknowledgment_table_20200518.xls), is available at https://www.hivresearch.org/publication-supplements. |
T210 |
285-319 |
Sentence |
denotes |
Sequence Processing and Filtering. |
T211 |
320-540 |
Sentence |
denotes |
All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed. |
T212 |
541-622 |
Sentence |
denotes |
Sequences were processed using the Biostrings package (version 2.48.0) in R (49). |
T213 |
623-818 |
Sentence |
denotes |
Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained. |
T214 |
819-981 |
Sentence |
denotes |
Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50). |
T215 |
982-1157 |
Sentence |
denotes |
Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs. |
T216 |
1158-1462 |
Sentence |
denotes |
Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf). |
T217 |
1463-1784 |
Sentence |
denotes |
To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51). |
T218 |
1785-1928 |
Sentence |
denotes |
This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53). |
T219 |
1929-2028 |
Sentence |
denotes |
Outliers were defined as sequences that had studentized residuals greater than 3, and were removed. |
T220 |
2029-2158 |
Sentence |
denotes |
Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset. |
T221 |
2159-2460 |
Sentence |
denotes |
To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1). |
T222 |
2461-2660 |
Sentence |
denotes |
A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology. |
T223 |
2661-2783 |
Sentence |
denotes |
These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day. |
T224 |
2784-2882 |
Sentence |
denotes |
After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences. |
T225 |
2884-2915 |
Sentence |
denotes |
Global Phylogeny and Evolution. |
T226 |
2916-3094 |
Sentence |
denotes |
The global phylogeny was reconstructed in FastTree v2.10.1 compiled with double precision under the GTR model with gamma heterogeneity (51), and rooted at the reference sequence. |
T227 |
3095-3142 |
Sentence |
denotes |
The tree was visualized using ggtree in R (54). |
T228 |
3143-3373 |
Sentence |
denotes |
Lineages were defined using PANGOLIN (Phylogenetic Assignment of Named Global Outbreak LINeages), with lineages with >200 taxa as of the May 19 summary being highlighted in the tree (22) (https://github.com/cov-lineages/lineages). |
T229 |
3374-3554 |
Sentence |
denotes |
The number of polymorphic sites was calculated as the number of sites which had at least one mutation relative to the reference sequence, Wuhan-Hu-1, ignoring gaps and ambiguities. |
T230 |
3556-3586 |
Sentence |
denotes |
Pairwise Distance Comparisons. |
T231 |
3587-3742 |
Sentence |
denotes |
For each pair of sequences, we calculated the Hamming distance as the number of sites that are different after removing sites with ambiguities and/or gaps. |
T232 |
3743-3909 |
Sentence |
denotes |
For computational efficiency, given the size of the alignment, this was implemented in parallel in C++, using Bazel (https://bazel.build/) to build on a Linux system. |
T233 |
3910-4010 |
Sentence |
denotes |
This implementation is available to download at https://www.hivresearch.org/publication-supplements. |
T234 |
4012-4040 |
Sentence |
denotes |
Subsampling Gene Alignments. |
T235 |
4041-4121 |
Sentence |
denotes |
Alignments for each gene were subsampled for sequence and phylogenetic analyses. |
T236 |
4122-4226 |
Sentence |
denotes |
Each gene alignment was randomly subsampled 100 times per collection date at 5%, 10%, 20%, 30%, and 40%. |
T237 |
4227-4319 |
Sentence |
denotes |
When fewer than 10 sequences were available for a collection date, all sequences were taken. |
T238 |
4320-4397 |
Sentence |
denotes |
Median Hamming distances were computed for each set of subsampled alignments. |
T239 |
4398-4540 |
Sentence |
denotes |
These were bootstrapped 100,000 times, and 95% CIs were estimated and compared to the median Hamming distance for the fully sampled alignment. |
T240 |
4542-4615 |
Sentence |
denotes |
Global and Site-Specific Nonsynonymous and Synonymous Substitution Rates. |
T241 |
4616-4696 |
Sentence |
denotes |
Alignments subsampled at 10% 100 times were used to estimate substitution rates. |
T242 |
4697-4893 |
Sentence |
denotes |
For the set of subsampled alignments for each gene, a mixed-effect likelihood method was used to estimate nonsynonymous (dN) and synonymous (dS) substitution rates globally and at each codon (29). |
T243 |
4894-5116 |
Sentence |
denotes |
Maximum-likelihood phylogenies were constructed for each alignment using the software IQ-TREE (55) under a best-fit model determined with ModelFinder (56) to prime the dN and dS estimates before branch length optimization. |
T244 |
5117-5171 |
Sentence |
denotes |
This step serves to expedite the optimization process. |
T245 |
5172-5285 |
Sentence |
denotes |
Branch length optimization was done with a MG94 model [which is the only model available for this analysis (29)]. |
T246 |
5286-5489 |
Sentence |
denotes |
The proportion of each phylogeny evolving under neutral (or negative) selection was determined from the mixture density across lineages for each site, assuming different dN and dS along each branch (57). |
T247 |
5490-5703 |
Sentence |
denotes |
On the same set of subsampled alignments and phylogenies, a fixed-effects likelihood method was used on internal branches to identify sites under pervasive diversifying selection and to estimate global dN/dS (58). |
T248 |
5704-6013 |
Sentence |
denotes |
Known biases associated with calculating dN/dS on exponentially growing populations (59) were counterbalanced by subsampling phylogenies, as the typical approach to address this bias, which is to ignore terminal branches, would considerably diminish the power of the analysis to detect any significant result. |
T249 |
6014-6384 |
Sentence |
denotes |
As P values from the fixed-effect likelihood method are uncorrected, results were not averaged over P values; rather, given that P value calculations are conservative for this analysis (58), sites were considered to be under pervasive diversifying selection if their P value was <0.1 in ≥50% of alignments, which would account for a typical 5% false discovery rate (58). |
T250 |
6386-6438 |
Sentence |
denotes |
Global and Gene-Specific Population Differentiation. |
T251 |
6439-6527 |
Sentence |
denotes |
Alignments subsampled at 10% 100 times were used to estimate population differentiation. |
T252 |
6528-6659 |
Sentence |
denotes |
The genetic differentiation of subpopulations within sampled sequences was calculated on each gene separately using Nei’s (30) GST. |
T253 |
6660-6987 |
Sentence |
denotes |
Because comparisons between subpopulations of different sizes can bias genetic differentiation estimates (60), genetic differentiation was also calculated using Jost’s (31) D, which accounts for differences in genetic heterogeneity between subpopulations and is intended to correct for biases in the size of the subpopulations. |
T254 |
6988-7059 |
Sentence |
denotes |
Both statistics were computed with the mmod package (32) in R (v3.6.1). |
T255 |
7060-7162 |
Sentence |
denotes |
For each gene, statistics were calculated over 100 bootstrapped samples for each subsampled alignment. |
T256 |
7163-7203 |
Sentence |
denotes |
Subpopulations were defined in two ways. |
T257 |
7204-7363 |
Sentence |
denotes |
First, sequences originating from the initial outbreak in the Hubei province (30 sequences) were compared to all other sequences within a subsampled alignment. |
T258 |
7364-7559 |
Sentence |
denotes |
Second, a 1-wk sliding window was designed to compare all sequences sampled prior to a collection date (subpopulation 1) to all sequences sampled after the same collection date (subpopulation 2). |
T259 |
7560-7808 |
Sentence |
denotes |
The first collection date for subpopulation 1 was February 14, 2020, the week after the last sequence from the Hubei province was sampled (February 8, 2020), The window was designed to terminate when <30 sequences were available in subpopulation 2. |
T260 |
7810-7867 |
Sentence |
denotes |
Time-Dependent Estimates of Phylogenetic Diversification. |
T261 |
7868-8207 |
Sentence |
denotes |
Time-dependent estimates of phylogenetic diversification were measured by extracting the branches descending from each internal node (above the root) of each phylogeny and calculating the peak height (η) of the spectral density profile of the graph Laplacian of each subtree, which is a measure of the density of branching events (36, 37). |
T262 |
8208-8322 |
Sentence |
denotes |
The code to perform the analysis is available for download at https://www.hivresearch.org/publication-supplements. |
T263 |
8324-8344 |
Sentence |
denotes |
Simulation analyses. |
T264 |
8345-8656 |
Sentence |
denotes |
Phylogenies were simulated using a time-forward branching process under constant birth rates (b(t)=b) and time-dependent birth rates (b(t)=beαt) for b = 0.01, 0.03, 0.05, 0.07, and 0.09 and α = ±0.01, ±0.11, ±0.21, ±0.31, and ±0.41, for 20, 220, 420, 620, and 820 tips, and for 1, 11, 21, 31, and 41 time units. |
T265 |
8657-8726 |
Sentence |
denotes |
Simulated phylogenies were downsampled at 0%, 10%, 30%, 50%, and 70%. |
T266 |
8727-8777 |
Sentence |
denotes |
For each scenario, 100 phylogenies were simulated. |
T267 |
8778-8899 |
Sentence |
denotes |
Time-dependent diversification (i.e., η across subtrees) was calculated for each phylogeny simulated under each scenario. |
T268 |
8900-9018 |
Sentence |
denotes |
Simulations were conducted using the R packages RPANDA (R Phylogenetic ANalyses of DiversificAtion) (61) and ape (53). |
T269 |
9020-9056 |
Sentence |
denotes |
Comparisons to SARS-CoV-2 phylogeny. |
T270 |
9057-9304 |
Sentence |
denotes |
Phylogenies downsampled at 10% from the full (18,514 tips) SARS-CoV-2 genome phylogeny (following the subsampling strategy described above) were used to calculate the phylogenetic η for each subtree (above the root) for each downsampled phylogeny. |
T271 |
9305-9470 |
Sentence |
denotes |
Neutral phylogenies were simulated under stochastic branching by randomly sampling from the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny. |
T272 |
9471-9535 |
Sentence |
denotes |
This was iterated across all downsampled SARS-CoV-2 phylogenies. |
T273 |
9536-9806 |
Sentence |
denotes |
Positive time-dependent phylogenies were simulated using a time-dependent process (b(t)=0.01eαt) for α = 0.001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 1, with branch lengths restricted to the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny. |
T274 |
9807-9882 |
Sentence |
denotes |
This was iterated across all downsampled SARS-CoV-2 phylogenies for each α. |
T275 |
9883-9975 |
Sentence |
denotes |
Neutral and positive time-dependent phylogenies were simulated with a 10% sampling fraction. |
T276 |
9976-10010 |
Sentence |
denotes |
Polytomies were randomly resolved. |
T277 |
10011-10084 |
Sentence |
denotes |
Simulations were conducted using the R packages RPANDA (61) and ape (53). |
T278 |
10086-10130 |
Sentence |
denotes |
Ancestral S Protein Sequence Reconstruction. |
T279 |
10131-10533 |
Sentence |
denotes |
Ancestral S protein sequences were reconstructed from an amino acid alignment of 30 SARS-CoV-2 sequences sampled from the Hubei province, a coronavirus sampled from bat (Yunnan RaTG13), and six SARS-CoV-2-like coronaviruses sampled from pangolins using maximum posterior probability and returning a unique residue at each site assuming a Jones-Taylor-Thornton (JTT) model with gamma heterogeneity (62). |
T280 |
10534-10610 |
Sentence |
denotes |
The JTT model was the most appropriate model available in the software (62). |
T281 |
10611-10715 |
Sentence |
denotes |
The bat sequence was retrieved from GenBank, and the pangolin sequences were retrieved from GISAID (63). |
T282 |
10716-10930 |
Sentence |
denotes |
A sliding window of 10 amino acids (and a step of 1 amino acid) was used to compare the cumulative number of mutations in the human−bat and human−bat−pangolin ancestors with respect to the human ancestral sequence. |
T283 |
10931-11148 |
Sentence |
denotes |
Median values for each window were compared to a null window (computed as a normal distribution of 10 values with a mean equal to the mean value across the entire S protein, 0.046 mutations) using a one-tailed t test. |
T284 |
11149-11268 |
Sentence |
denotes |
An alignment including the reconstructed sequences is available at https://www.hivresearch.org/publication-supplements. |
T285 |
11270-11314 |
Sentence |
denotes |
Prediction of CD4+ and CD8+ T Cell Epitopes. |
T286 |
11315-11400 |
Sentence |
denotes |
CD4+ and CD8+ T cell epitopes were predicted for four SARS-CoV-2 structural proteins: |
T287 |
11401-11516 |
Sentence |
denotes |
S (accession YP_009724390), N (accession YP_009724397), M (accession YP_009724393), and E (accession YP_009724392). |
T288 |
11517-11723 |
Sentence |
denotes |
CD4+ T cell epitopes were predicted using a server that predicts binding of peptides to any MHC molecule of known sequence using artificial neural networks, NetMHCIIPan 4.0 (64) with a peptide length of 15. |
T289 |
11724-11967 |
Sentence |
denotes |
MHC class II HLA alleles of HLA-DQB1, plus the haplotypes of HLA-DPA1-DPB1 and HLA-DQA1-DPB, were selected for predictions if they had frequencies of >1/1,000 in known allele/haplotype distributions (http://17ihiw.org/17th-ihiw-ngs-hla-data/). |
T290 |
11968-12079 |
Sentence |
denotes |
If multiple peptides had the same core, the peptide with the strongest binding score was selected for analysis. |
T291 |
12080-12168 |
Sentence |
denotes |
CD8+ T cell epitopes were predicted using NetMHCPan 4.1 (64) with a peptide length of 9. |
T292 |
12169-12423 |
Sentence |
denotes |
MHC class I HLA alleles of HLA-A, HLA-B, and HLA-C were selected if they were classified as common (frequency ≥ 1/10,000) in any of the populations in the database CIWD 3.0 (Common, Intermediate and Well-Documented HLA Alleles in World Populations) (65). |
T293 |
12424-12536 |
Sentence |
denotes |
Epitopes predicted as strong binders (with predicted binding affinities below 50 nM) were selected for analyses. |
T294 |
12538-12566 |
Sentence |
denotes |
T Cell Immunogenicity Index. |
T295 |
12567-12807 |
Sentence |
denotes |
For each site in a predicted epitope, the immunogenicity index was defined as the sum of the frequency of the HLA alleles or haplotypes restricting the corresponding epitope (multiple epitopes can be predicted at a given site in a protein). |
T296 |
12808-13148 |
Sentence |
denotes |
Total frequencies from CIWD 3.0 were used as the frequencies of the corresponding MHC class I HLA alleles (HLA-A, HLA-B, and HLA-C), and the global frequencies from http://17ihiw.org/17th-ihiw-ngs-hla-data/ were used as the frequencies of the corresponding MHC class II HLA alleles or haplotypes (HLA-DQB1, HLA-DPA1-DPB1, and HLA-DQA1-DPB). |
T297 |
13149-13298 |
Sentence |
denotes |
This procedure was repeated using the frequencies of MHC alleles or haplotypes in different subpopulations listed in the above HLA frequency dataset. |
T298 |
13300-13321 |
Sentence |
denotes |
Statistical Analyses. |
T299 |
13322-13409 |
Sentence |
denotes |
For comparisons of mean values in normally distributed data, Student’s t test was used. |
T300 |
13410-13470 |
Sentence |
denotes |
When data were not normal, the Mann−Whitney U test was used. |
T301 |
13471-13523 |
Sentence |
denotes |
Shapiro−Wilk tests were used to determine normality. |
T302 |
13524-13607 |
Sentence |
denotes |
Differences in data distributions were estimated using the Kolmogorov−Smirnov test. |