PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T206	0-21	Sentence	denotes	Materials and Methods
T207	23-37	Sentence	denotes	Sequence Data.
T208	38-102	Sentence	denotes	Sequences were downloaded from GISAID (https://www.gisaid.org/).
T209	103-283	Sentence	denotes	A full list, along with the originating and submitting laboratories (GISAID_acknowledgment_table_20200518.xls), is available at https://www.hivresearch.org/publication-supplements.
T210	285-319	Sentence	denotes	Sequence Processing and Filtering.
T211	320-540	Sentence	denotes	All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.
T212	541-622	Sentence	denotes	Sequences were processed using the Biostrings package (version 2.48.0) in R (49).
T213	623-818	Sentence	denotes	Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.
T214	819-981	Sentence	denotes	Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).
T215	982-1157	Sentence	denotes	Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.
T216	1158-1462	Sentence	denotes	Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).
T217	1463-1784	Sentence	denotes	To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51).
T218	1785-1928	Sentence	denotes	This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53).
T219	1929-2028	Sentence	denotes	Outliers were defined as sequences that had studentized residuals greater than 3, and were removed.
T220	2029-2158	Sentence	denotes	Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset.
T221	2159-2460	Sentence	denotes	To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1).
T222	2461-2660	Sentence	denotes	A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology.
T223	2661-2783	Sentence	denotes	These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day.
T224	2784-2882	Sentence	denotes	After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences.
T225	2884-2915	Sentence	denotes	Global Phylogeny and Evolution.
T226	2916-3094	Sentence	denotes	The global phylogeny was reconstructed in FastTree v2.10.1 compiled with double precision under the GTR model with gamma heterogeneity (51), and rooted at the reference sequence.
T227	3095-3142	Sentence	denotes	The tree was visualized using ggtree in R (54).
T228	3143-3373	Sentence	denotes	Lineages were defined using PANGOLIN (Phylogenetic Assignment of Named Global Outbreak LINeages), with lineages with >200 taxa as of the May 19 summary being highlighted in the tree (22) (https://github.com/cov-lineages/lineages).
T229	3374-3554	Sentence	denotes	The number of polymorphic sites was calculated as the number of sites which had at least one mutation relative to the reference sequence, Wuhan-Hu-1, ignoring gaps and ambiguities.
T230	3556-3586	Sentence	denotes	Pairwise Distance Comparisons.
T231	3587-3742	Sentence	denotes	For each pair of sequences, we calculated the Hamming distance as the number of sites that are different after removing sites with ambiguities and/or gaps.
T232	3743-3909	Sentence	denotes	For computational efficiency, given the size of the alignment, this was implemented in parallel in C++, using Bazel (https://bazel.build/) to build on a Linux system.
T233	3910-4010	Sentence	denotes	This implementation is available to download at https://www.hivresearch.org/publication-supplements.
T234	4012-4040	Sentence	denotes	Subsampling Gene Alignments.
T235	4041-4121	Sentence	denotes	Alignments for each gene were subsampled for sequence and phylogenetic analyses.
T236	4122-4226	Sentence	denotes	Each gene alignment was randomly subsampled 100 times per collection date at 5%, 10%, 20%, 30%, and 40%.
T237	4227-4319	Sentence	denotes	When fewer than 10 sequences were available for a collection date, all sequences were taken.
T238	4320-4397	Sentence	denotes	Median Hamming distances were computed for each set of subsampled alignments.
T239	4398-4540	Sentence	denotes	These were bootstrapped 100,000 times, and 95% CIs were estimated and compared to the median Hamming distance for the fully sampled alignment.
T240	4542-4615	Sentence	denotes	Global and Site-Specific Nonsynonymous and Synonymous Substitution Rates.
T241	4616-4696	Sentence	denotes	Alignments subsampled at 10% 100 times were used to estimate substitution rates.
T242	4697-4893	Sentence	denotes	For the set of subsampled alignments for each gene, a mixed-effect likelihood method was used to estimate nonsynonymous (dN) and synonymous (dS) substitution rates globally and at each codon (29).
T243	4894-5116	Sentence	denotes	Maximum-likelihood phylogenies were constructed for each alignment using the software IQ-TREE (55) under a best-fit model determined with ModelFinder (56) to prime the dN and dS estimates before branch length optimization.
T244	5117-5171	Sentence	denotes	This step serves to expedite the optimization process.
T245	5172-5285	Sentence	denotes	Branch length optimization was done with a MG94 model [which is the only model available for this analysis (29)].
T246	5286-5489	Sentence	denotes	The proportion of each phylogeny evolving under neutral (or negative) selection was determined from the mixture density across lineages for each site, assuming different dN and dS along each branch (57).
T247	5490-5703	Sentence	denotes	On the same set of subsampled alignments and phylogenies, a fixed-effects likelihood method was used on internal branches to identify sites under pervasive diversifying selection and to estimate global dN/dS (58).
T248	5704-6013	Sentence	denotes	Known biases associated with calculating dN/dS on exponentially growing populations (59) were counterbalanced by subsampling phylogenies, as the typical approach to address this bias, which is to ignore terminal branches, would considerably diminish the power of the analysis to detect any significant result.
T249	6014-6384	Sentence	denotes	As P values from the fixed-effect likelihood method are uncorrected, results were not averaged over P values; rather, given that P value calculations are conservative for this analysis (58), sites were considered to be under pervasive diversifying selection if their P value was <0.1 in ≥50% of alignments, which would account for a typical 5% false discovery rate (58).
T250	6386-6438	Sentence	denotes	Global and Gene-Specific Population Differentiation.
T251	6439-6527	Sentence	denotes	Alignments subsampled at 10% 100 times were used to estimate population differentiation.
T252	6528-6659	Sentence	denotes	The genetic differentiation of subpopulations within sampled sequences was calculated on each gene separately using Nei’s (30) GST.
T253	6660-6987	Sentence	denotes	Because comparisons between subpopulations of different sizes can bias genetic differentiation estimates (60), genetic differentiation was also calculated using Jost’s (31) D, which accounts for differences in genetic heterogeneity between subpopulations and is intended to correct for biases in the size of the subpopulations.
T254	6988-7059	Sentence	denotes	Both statistics were computed with the mmod package (32) in R (v3.6.1).
T255	7060-7162	Sentence	denotes	For each gene, statistics were calculated over 100 bootstrapped samples for each subsampled alignment.
T256	7163-7203	Sentence	denotes	Subpopulations were defined in two ways.
T257	7204-7363	Sentence	denotes	First, sequences originating from the initial outbreak in the Hubei province (30 sequences) were compared to all other sequences within a subsampled alignment.
T258	7364-7559	Sentence	denotes	Second, a 1-wk sliding window was designed to compare all sequences sampled prior to a collection date (subpopulation 1) to all sequences sampled after the same collection date (subpopulation 2).
T259	7560-7808	Sentence	denotes	The first collection date for subpopulation 1 was February 14, 2020, the week after the last sequence from the Hubei province was sampled (February 8, 2020), The window was designed to terminate when <30 sequences were available in subpopulation 2.
T260	7810-7867	Sentence	denotes	Time-Dependent Estimates of Phylogenetic Diversification.
T261	7868-8207	Sentence	denotes	Time-dependent estimates of phylogenetic diversification were measured by extracting the branches descending from each internal node (above the root) of each phylogeny and calculating the peak height (η) of the spectral density profile of the graph Laplacian of each subtree, which is a measure of the density of branching events (36, 37).
T262	8208-8322	Sentence	denotes	The code to perform the analysis is available for download at https://www.hivresearch.org/publication-supplements.
T263	8324-8344	Sentence	denotes	Simulation analyses.
T264	8345-8656	Sentence	denotes	Phylogenies were simulated using a time-forward branching process under constant birth rates (b(t)=b) and time-dependent birth rates (b(t)=beαt) for b = 0.01, 0.03, 0.05, 0.07, and 0.09 and α = ±0.01, ±0.11, ±0.21, ±0.31, and ±0.41, for 20, 220, 420, 620, and 820 tips, and for 1, 11, 21, 31, and 41 time units.
T265	8657-8726	Sentence	denotes	Simulated phylogenies were downsampled at 0%, 10%, 30%, 50%, and 70%.
T266	8727-8777	Sentence	denotes	For each scenario, 100 phylogenies were simulated.
T267	8778-8899	Sentence	denotes	Time-dependent diversification (i.e., η across subtrees) was calculated for each phylogeny simulated under each scenario.
T268	8900-9018	Sentence	denotes	Simulations were conducted using the R packages RPANDA (R Phylogenetic ANalyses of DiversificAtion) (61) and ape (53).
T269	9020-9056	Sentence	denotes	Comparisons to SARS-CoV-2 phylogeny.
T270	9057-9304	Sentence	denotes	Phylogenies downsampled at 10% from the full (18,514 tips) SARS-CoV-2 genome phylogeny (following the subsampling strategy described above) were used to calculate the phylogenetic η for each subtree (above the root) for each downsampled phylogeny.
T271	9305-9470	Sentence	denotes	Neutral phylogenies were simulated under stochastic branching by randomly sampling from the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny.
T272	9471-9535	Sentence	denotes	This was iterated across all downsampled SARS-CoV-2 phylogenies.
T273	9536-9806	Sentence	denotes	Positive time-dependent phylogenies were simulated using a time-dependent process (b(t)=0.01eαt) for α = 0.001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 1, with branch lengths restricted to the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny.
T274	9807-9882	Sentence	denotes	This was iterated across all downsampled SARS-CoV-2 phylogenies for each α.
T275	9883-9975	Sentence	denotes	Neutral and positive time-dependent phylogenies were simulated with a 10% sampling fraction.
T276	9976-10010	Sentence	denotes	Polytomies were randomly resolved.
T277	10011-10084	Sentence	denotes	Simulations were conducted using the R packages RPANDA (61) and ape (53).
T278	10086-10130	Sentence	denotes	Ancestral S Protein Sequence Reconstruction.
T279	10131-10533	Sentence	denotes	Ancestral S protein sequences were reconstructed from an amino acid alignment of 30 SARS-CoV-2 sequences sampled from the Hubei province, a coronavirus sampled from bat (Yunnan RaTG13), and six SARS-CoV-2-like coronaviruses sampled from pangolins using maximum posterior probability and returning a unique residue at each site assuming a Jones-Taylor-Thornton (JTT) model with gamma heterogeneity (62).
T280	10534-10610	Sentence	denotes	The JTT model was the most appropriate model available in the software (62).
T281	10611-10715	Sentence	denotes	The bat sequence was retrieved from GenBank, and the pangolin sequences were retrieved from GISAID (63).
T282	10716-10930	Sentence	denotes	A sliding window of 10 amino acids (and a step of 1 amino acid) was used to compare the cumulative number of mutations in the human−bat and human−bat−pangolin ancestors with respect to the human ancestral sequence.
T283	10931-11148	Sentence	denotes	Median values for each window were compared to a null window (computed as a normal distribution of 10 values with a mean equal to the mean value across the entire S protein, 0.046 mutations) using a one-tailed t test.
T284	11149-11268	Sentence	denotes	An alignment including the reconstructed sequences is available at https://www.hivresearch.org/publication-supplements.
T285	11270-11314	Sentence	denotes	Prediction of CD4+ and CD8+ T Cell Epitopes.
T286	11315-11400	Sentence	denotes	CD4+ and CD8+ T cell epitopes were predicted for four SARS-CoV-2 structural proteins:
T287	11401-11516	Sentence	denotes	S (accession YP_009724390), N (accession YP_009724397), M (accession YP_009724393), and E (accession YP_009724392).
T288	11517-11723	Sentence	denotes	CD4+ T cell epitopes were predicted using a server that predicts binding of peptides to any MHC molecule of known sequence using artificial neural networks, NetMHCIIPan 4.0 (64) with a peptide length of 15.
T289	11724-11967	Sentence	denotes	MHC class II HLA alleles of HLA-DQB1, plus the haplotypes of HLA-DPA1-DPB1 and HLA-DQA1-DPB, were selected for predictions if they had frequencies of >1/1,000 in known allele/haplotype distributions (http://17ihiw.org/17th-ihiw-ngs-hla-data/).
T290	11968-12079	Sentence	denotes	If multiple peptides had the same core, the peptide with the strongest binding score was selected for analysis.
T291	12080-12168	Sentence	denotes	CD8+ T cell epitopes were predicted using NetMHCPan 4.1 (64) with a peptide length of 9.
T292	12169-12423	Sentence	denotes	MHC class I HLA alleles of HLA-A, HLA-B, and HLA-C were selected if they were classified as common (frequency ≥ 1/10,000) in any of the populations in the database CIWD 3.0 (Common, Intermediate and Well-Documented HLA Alleles in World Populations) (65).
T293	12424-12536	Sentence	denotes	Epitopes predicted as strong binders (with predicted binding affinities below 50 nM) were selected for analyses.
T294	12538-12566	Sentence	denotes	T Cell Immunogenicity Index.
T295	12567-12807	Sentence	denotes	For each site in a predicted epitope, the immunogenicity index was defined as the sum of the frequency of the HLA alleles or haplotypes restricting the corresponding epitope (multiple epitopes can be predicted at a given site in a protein).
T296	12808-13148	Sentence	denotes	Total frequencies from CIWD 3.0 were used as the frequencies of the corresponding MHC class I HLA alleles (HLA-A, HLA-B, and HLA-C), and the global frequencies from http://17ihiw.org/17th-ihiw-ngs-hla-data/ were used as the frequencies of the corresponding MHC class II HLA alleles or haplotypes (HLA-DQB1, HLA-DPA1-DPB1, and HLA-DQA1-DPB).
T297	13149-13298	Sentence	denotes	This procedure was repeated using the frequencies of MHC alleles or haplotypes in different subpopulations listed in the above HLA frequency dataset.
T298	13300-13321	Sentence	denotes	Statistical Analyses.
T299	13322-13409	Sentence	denotes	For comparisons of mean values in normally distributed data, Student’s t test was used.
T300	13410-13470	Sentence	denotes	When data were not normal, the Mann−Whitney U test was used.
T301	13471-13523	Sentence	denotes	Shapiro−Wilk tests were used to determine normality.
T302	13524-13607	Sentence	denotes	Differences in data distributions were estimated using the Kolmogorov−Smirnov test.

T206

0-21

Sentence

denotes

Materials and Methods

T207

23-37

Sentence

denotes

Sequence Data.

T208

38-102

Sentence

denotes

Sequences were downloaded from GISAID (https://www.gisaid.org/).

T209

103-283

Sentence

denotes

A full list, along with the originating and submitting laboratories (GISAID_acknowledgment_table_20200518.xls), is available at https://www.hivresearch.org/publication-supplements.

T210

285-319

Sentence

denotes

Sequence Processing and Filtering.

T211

320-540

Sentence

denotes

All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.

T212

541-622

Sentence

denotes

Sequences were processed using the Biostrings package (version 2.48.0) in R (49).

T213

623-818

Sentence

denotes

Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.

T214

819-981

Sentence

denotes

Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).

T215

982-1157

Sentence

denotes

Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.

T216

1158-1462

Sentence

denotes

Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).

T217

1463-1784

Sentence

denotes

To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51).

T218

1785-1928

Sentence

denotes

This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53).

T219

1929-2028

Sentence

denotes

Outliers were defined as sequences that had studentized residuals greater than 3, and were removed.

T220

2029-2158

Sentence

denotes

Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset.

T221

2159-2460

Sentence

denotes

To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1).

T222

2461-2660

Sentence

denotes

A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology.

T223

2661-2783

Sentence

denotes

These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day.

T224

2784-2882

Sentence

denotes

After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences.

T225

2884-2915

Sentence

denotes

Global Phylogeny and Evolution.

T226

2916-3094

Sentence

denotes

The global phylogeny was reconstructed in FastTree v2.10.1 compiled with double precision under the GTR model with gamma heterogeneity (51), and rooted at the reference sequence.

T227

3095-3142

Sentence

denotes

The tree was visualized using ggtree in R (54).

T228

3143-3373

Sentence

denotes

Lineages were defined using PANGOLIN (Phylogenetic Assignment of Named Global Outbreak LINeages), with lineages with >200 taxa as of the May 19 summary being highlighted in the tree (22) (https://github.com/cov-lineages/lineages).

T229

3374-3554

Sentence

denotes

The number of polymorphic sites was calculated as the number of sites which had at least one mutation relative to the reference sequence, Wuhan-Hu-1, ignoring gaps and ambiguities.

T230

3556-3586

Sentence

denotes

Pairwise Distance Comparisons.

T231

3587-3742

Sentence

denotes

For each pair of sequences, we calculated the Hamming distance as the number of sites that are different after removing sites with ambiguities and/or gaps.

T232

3743-3909

Sentence

denotes

For computational efficiency, given the size of the alignment, this was implemented in parallel in C++, using Bazel (https://bazel.build/) to build on a Linux system.

T233

3910-4010

Sentence

denotes

This implementation is available to download at https://www.hivresearch.org/publication-supplements.

T234

4012-4040

Sentence

denotes

Subsampling Gene Alignments.

T235

4041-4121

Sentence

denotes

Alignments for each gene were subsampled for sequence and phylogenetic analyses.

T236

4122-4226

Sentence

denotes

Each gene alignment was randomly subsampled 100 times per collection date at 5%, 10%, 20%, 30%, and 40%.

T237

4227-4319

Sentence

denotes

When fewer than 10 sequences were available for a collection date, all sequences were taken.

T238

4320-4397

Sentence

denotes

Median Hamming distances were computed for each set of subsampled alignments.

T239

4398-4540

Sentence

denotes

These were bootstrapped 100,000 times, and 95% CIs were estimated and compared to the median Hamming distance for the fully sampled alignment.

T240

4542-4615

Sentence

denotes

Global and Site-Specific Nonsynonymous and Synonymous Substitution Rates.

T241

4616-4696

Sentence

denotes

Alignments subsampled at 10% 100 times were used to estimate substitution rates.

T242

4697-4893

Sentence

denotes

For the set of subsampled alignments for each gene, a mixed-effect likelihood method was used to estimate nonsynonymous (dN) and synonymous (dS) substitution rates globally and at each codon (29).

T243

4894-5116

Sentence

denotes

Maximum-likelihood phylogenies were constructed for each alignment using the software IQ-TREE (55) under a best-fit model determined with ModelFinder (56) to prime the dN and dS estimates before branch length optimization.

T244

5117-5171

Sentence

denotes

This step serves to expedite the optimization process.

T245

5172-5285

Sentence

denotes

Branch length optimization was done with a MG94 model [which is the only model available for this analysis (29)].

T246

5286-5489

Sentence

denotes

The proportion of each phylogeny evolving under neutral (or negative) selection was determined from the mixture density across lineages for each site, assuming different dN and dS along each branch (57).

T247

5490-5703

Sentence

denotes

On the same set of subsampled alignments and phylogenies, a fixed-effects likelihood method was used on internal branches to identify sites under pervasive diversifying selection and to estimate global dN/dS (58).

T248

5704-6013

Sentence

denotes

Known biases associated with calculating dN/dS on exponentially growing populations (59) were counterbalanced by subsampling phylogenies, as the typical approach to address this bias, which is to ignore terminal branches, would considerably diminish the power of the analysis to detect any significant result.

T249

6014-6384

Sentence

denotes

As P values from the fixed-effect likelihood method are uncorrected, results were not averaged over P values; rather, given that P value calculations are conservative for this analysis (58), sites were considered to be under pervasive diversifying selection if their P value was <0.1 in ≥50% of alignments, which would account for a typical 5% false discovery rate (58).

T250

6386-6438

Sentence

denotes

Global and Gene-Specific Population Differentiation.

T251

6439-6527

Sentence

denotes

Alignments subsampled at 10% 100 times were used to estimate population differentiation.

T252

6528-6659

Sentence

denotes

The genetic differentiation of subpopulations within sampled sequences was calculated on each gene separately using Nei’s (30) GST.

T253

6660-6987

Sentence

denotes

Because comparisons between subpopulations of different sizes can bias genetic differentiation estimates (60), genetic differentiation was also calculated using Jost’s (31) D, which accounts for differences in genetic heterogeneity between subpopulations and is intended to correct for biases in the size of the subpopulations.

T254

6988-7059

Sentence

denotes

Both statistics were computed with the mmod package (32) in R (v3.6.1).

T255

7060-7162

Sentence

denotes

For each gene, statistics were calculated over 100 bootstrapped samples for each subsampled alignment.

T256

7163-7203

Sentence

denotes

Subpopulations were defined in two ways.

T257

7204-7363

Sentence

denotes

First, sequences originating from the initial outbreak in the Hubei province (30 sequences) were compared to all other sequences within a subsampled alignment.

T258

7364-7559

Sentence

denotes

Second, a 1-wk sliding window was designed to compare all sequences sampled prior to a collection date (subpopulation 1) to all sequences sampled after the same collection date (subpopulation 2).

T259

7560-7808

Sentence

denotes

The first collection date for subpopulation 1 was February 14, 2020, the week after the last sequence from the Hubei province was sampled (February 8, 2020), The window was designed to terminate when <30 sequences were available in subpopulation 2.

T260

7810-7867

Sentence

denotes

Time-Dependent Estimates of Phylogenetic Diversification.

T261

7868-8207

Sentence

denotes

Time-dependent estimates of phylogenetic diversification were measured by extracting the branches descending from each internal node (above the root) of each phylogeny and calculating the peak height (η) of the spectral density profile of the graph Laplacian of each subtree, which is a measure of the density of branching events (36, 37).

T262

8208-8322

Sentence

denotes

The code to perform the analysis is available for download at https://www.hivresearch.org/publication-supplements.

T263

8324-8344

Sentence

denotes

Simulation analyses.

T264

8345-8656

Sentence

denotes

Phylogenies were simulated using a time-forward branching process under constant birth rates (b(t)=b) and time-dependent birth rates (b(t)=beαt) for b = 0.01, 0.03, 0.05, 0.07, and 0.09 and α = ±0.01, ±0.11, ±0.21, ±0.31, and ±0.41, for 20, 220, 420, 620, and 820 tips, and for 1, 11, 21, 31, and 41 time units.

T265

8657-8726

Sentence

denotes

Simulated phylogenies were downsampled at 0%, 10%, 30%, 50%, and 70%.

T266

8727-8777

Sentence

denotes

For each scenario, 100 phylogenies were simulated.

T267

8778-8899

Sentence

denotes

Time-dependent diversification (i.e., η across subtrees) was calculated for each phylogeny simulated under each scenario.

T268

8900-9018

Sentence

denotes

Simulations were conducted using the R packages RPANDA (R Phylogenetic ANalyses of DiversificAtion) (61) and ape (53).

T269

9020-9056

Sentence

denotes

Comparisons to SARS-CoV-2 phylogeny.

T270

9057-9304

Sentence

denotes

Phylogenies downsampled at 10% from the full (18,514 tips) SARS-CoV-2 genome phylogeny (following the subsampling strategy described above) were used to calculate the phylogenetic η for each subtree (above the root) for each downsampled phylogeny.

T271

9305-9470

Sentence

denotes

Neutral phylogenies were simulated under stochastic branching by randomly sampling from the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny.

T272

9471-9535

Sentence

denotes

This was iterated across all downsampled SARS-CoV-2 phylogenies.

T273

9536-9806

Sentence

denotes

Positive time-dependent phylogenies were simulated using a time-dependent process (b(t)=0.01eαt) for α = 0.001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 1, with branch lengths restricted to the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny.

T274

9807-9882

Sentence

denotes

This was iterated across all downsampled SARS-CoV-2 phylogenies for each α.

T275

9883-9975

Sentence

denotes

Neutral and positive time-dependent phylogenies were simulated with a 10% sampling fraction.

T276

9976-10010

Sentence

denotes

Polytomies were randomly resolved.

T277

10011-10084

Sentence

denotes

Simulations were conducted using the R packages RPANDA (61) and ape (53).

T278

10086-10130

Sentence

denotes

Ancestral S Protein Sequence Reconstruction.

T279

10131-10533

Sentence

denotes

Ancestral S protein sequences were reconstructed from an amino acid alignment of 30 SARS-CoV-2 sequences sampled from the Hubei province, a coronavirus sampled from bat (Yunnan RaTG13), and six SARS-CoV-2-like coronaviruses sampled from pangolins using maximum posterior probability and returning a unique residue at each site assuming a Jones-Taylor-Thornton (JTT) model with gamma heterogeneity (62).

T280

10534-10610

Sentence

denotes

The JTT model was the most appropriate model available in the software (62).

T281

10611-10715

Sentence

denotes

The bat sequence was retrieved from GenBank, and the pangolin sequences were retrieved from GISAID (63).

T282

10716-10930

Sentence

denotes

A sliding window of 10 amino acids (and a step of 1 amino acid) was used to compare the cumulative number of mutations in the human−bat and human−bat−pangolin ancestors with respect to the human ancestral sequence.

T283

10931-11148

Sentence

denotes

Median values for each window were compared to a null window (computed as a normal distribution of 10 values with a mean equal to the mean value across the entire S protein, 0.046 mutations) using a one-tailed t test.

T284

11149-11268

Sentence

denotes

An alignment including the reconstructed sequences is available at https://www.hivresearch.org/publication-supplements.

T285

11270-11314

Sentence

denotes

Prediction of CD4+ and CD8+ T Cell Epitopes.

T286

11315-11400

Sentence

denotes

CD4+ and CD8+ T cell epitopes were predicted for four SARS-CoV-2 structural proteins:

T287

11401-11516

Sentence

denotes

S (accession YP_009724390), N (accession YP_009724397), M (accession YP_009724393), and E (accession YP_009724392).

T288

11517-11723

Sentence

denotes

CD4+ T cell epitopes were predicted using a server that predicts binding of peptides to any MHC molecule of known sequence using artificial neural networks, NetMHCIIPan 4.0 (64) with a peptide length of 15.

T289

11724-11967

Sentence

denotes

MHC class II HLA alleles of HLA-DQB1, plus the haplotypes of HLA-DPA1-DPB1 and HLA-DQA1-DPB, were selected for predictions if they had frequencies of >1/1,000 in known allele/haplotype distributions (http://17ihiw.org/17th-ihiw-ngs-hla-data/).

T290

11968-12079

Sentence

denotes

If multiple peptides had the same core, the peptide with the strongest binding score was selected for analysis.

T291

12080-12168

Sentence

denotes

CD8+ T cell epitopes were predicted using NetMHCPan 4.1 (64) with a peptide length of 9.

T292

12169-12423

Sentence

denotes

MHC class I HLA alleles of HLA-A, HLA-B, and HLA-C were selected if they were classified as common (frequency ≥ 1/10,000) in any of the populations in the database CIWD 3.0 (Common, Intermediate and Well-Documented HLA Alleles in World Populations) (65).

T293

12424-12536

Sentence

denotes

Epitopes predicted as strong binders (with predicted binding affinities below 50 nM) were selected for analyses.

T294

12538-12566

Sentence

denotes

T Cell Immunogenicity Index.

T295

12567-12807

Sentence

denotes

For each site in a predicted epitope, the immunogenicity index was defined as the sum of the frequency of the HLA alleles or haplotypes restricting the corresponding epitope (multiple epitopes can be predicted at a given site in a protein).

T296

12808-13148

Sentence

denotes

Total frequencies from CIWD 3.0 were used as the frequencies of the corresponding MHC class I HLA alleles (HLA-A, HLA-B, and HLA-C), and the global frequencies from http://17ihiw.org/17th-ihiw-ngs-hla-data/ were used as the frequencies of the corresponding MHC class II HLA alleles or haplotypes (HLA-DQB1, HLA-DPA1-DPB1, and HLA-DQA1-DPB).

T297

13149-13298

Sentence

denotes

This procedure was repeated using the frequencies of MHC alleles or haplotypes in different subpopulations listed in the above HLA frequency dataset.

T298

13300-13321

Sentence

denotes

Statistical Analyses.

T299

13322-13409

Sentence

denotes

For comparisons of mean values in normally distributed data, Student’s t test was used.

T300

13410-13470

Sentence

denotes

When data were not normal, the Mann−Whitney U test was used.

T301

13471-13523

Sentence

denotes

Shapiro−Wilk tests were used to determine normality.

T302

13524-13607

Sentence

denotes

Differences in data distributions were estimated using the Kolmogorov−Smirnov test.

PMC:7519301 / 34294-47901 JSON TXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

PMC:7519301 / 34294-47901 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

PMC:7519301 / 34294-47901 JSON TXT 9 Projects