PMC:7519301 / 34294-47901 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

Id Subject Object Predicate Lexical cue
T206 0-21 Sentence denotes Materials and Methods
T207 23-37 Sentence denotes Sequence Data.
T208 38-102 Sentence denotes Sequences were downloaded from GISAID (https://www.gisaid.org/).
T209 103-283 Sentence denotes A full list, along with the originating and submitting laboratories (GISAID_acknowledgment_table_20200518.xls), is available at https://www.hivresearch.org/publication-supplements.
T210 285-319 Sentence denotes Sequence Processing and Filtering.
T211 320-540 Sentence denotes All SARS-CoV-2 sequences available on GISAID as of May 18, 2020 (n = 27,989) were downloaded and deduplicated where possible, and those missing accurate dates (that is, only recording the month and/or year) were removed.
T212 541-622 Sentence denotes Sequences were processed using the Biostrings package (version 2.48.0) in R (49).
T213 623-818 Sentence denotes Sequences known to be linked through direct transmission were removed, and only the sample with the earliest date (chosen at random when multiple samples were taken on the same day) was retained.
T214 819-981 Sentence denotes Sequences were then aligned with Mafft v7.467 using the -addfragments option to align to the reference sequence (Wuhan-Hu1, GISAID accession EPI_ISL_402125) (50).
T215 982-1157 Sentence denotes Insertions relative to Wuhan-Hu-1 were removed, and the 5′ and 3′ ends of sequences (where coverage was low) were excised, resulting in an alignment consisting of the 10 ORFs.
T216 1158-1462 Sentence denotes Any sequences with less than 95% coverage of the ORFs (i.e., >5% gaps) were removed, and 30 homoplasic sites likely due to sequencing artifacts identified by de Maio et al. were masked (https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/archived_vcf/problematic_sites_sarsCov2.2020-05-27.vcf).
T217 1463-1784 Sentence denotes To identify individual sequences that were much more divergent than expected, given their sampling date, which likely reflected sequencing artifacts rather than evolution, we obtained a tree using FastTree v2.10.1 compiled with double precision under the general time reversible (GTR) model with gamma heterogeneity (51).
T218 1785-1928 Sentence denotes This tree was rooted at the reference sequence, and root-to-tip regression was performed following TempEst using the ape package in R (52, 53).
T219 1929-2028 Sentence denotes Outliers were defined as sequences that had studentized residuals greater than 3, and were removed.
T220 2029-2158 Sentence denotes Sequences from the United Kingdom corresponded to nearly half of the sequences (n = 12,157/25,671, 47%) of this filtered dataset.
T221 2159-2460 Sentence denotes To avoid overrepresentation of the UK sequences and bias in subsequent analyses, we investigated the effect of downsampling sequences on the mean Hamming distance and identified the minimum number of sequences required to recover the mean corresponding to the full distribution (SI Appendix, Fig. S1).
T222 2461-2660 Sentence denotes A subsample of 5,000 sequences satisfied these criteria, and also ensured that there were fewer sequences from the United Kingdom than from the United States (n = 5,398), reflecting the epidemiology.
T223 2661-2783 Sentence denotes These 5,000 sequences were sampled randomly, with weight proportional to the number of UK sequences collected on that day.
T224 2784-2882 Sentence denotes After these filtering steps, the alignment used for subsequent analyses included 18,514 sequences.
T225 2884-2915 Sentence denotes Global Phylogeny and Evolution.
T226 2916-3094 Sentence denotes The global phylogeny was reconstructed in FastTree v2.10.1 compiled with double precision under the GTR model with gamma heterogeneity (51), and rooted at the reference sequence.
T227 3095-3142 Sentence denotes The tree was visualized using ggtree in R (54).
T228 3143-3373 Sentence denotes Lineages were defined using PANGOLIN (Phylogenetic Assignment of Named Global Outbreak LINeages), with lineages with >200 taxa as of the May 19 summary being highlighted in the tree (22) (https://github.com/cov-lineages/lineages).
T229 3374-3554 Sentence denotes The number of polymorphic sites was calculated as the number of sites which had at least one mutation relative to the reference sequence, Wuhan-Hu-1, ignoring gaps and ambiguities.
T230 3556-3586 Sentence denotes Pairwise Distance Comparisons.
T231 3587-3742 Sentence denotes For each pair of sequences, we calculated the Hamming distance as the number of sites that are different after removing sites with ambiguities and/or gaps.
T232 3743-3909 Sentence denotes For computational efficiency, given the size of the alignment, this was implemented in parallel in C++, using Bazel (https://bazel.build/) to build on a Linux system.
T233 3910-4010 Sentence denotes This implementation is available to download at https://www.hivresearch.org/publication-supplements.
T234 4012-4040 Sentence denotes Subsampling Gene Alignments.
T235 4041-4121 Sentence denotes Alignments for each gene were subsampled for sequence and phylogenetic analyses.
T236 4122-4226 Sentence denotes Each gene alignment was randomly subsampled 100 times per collection date at 5%, 10%, 20%, 30%, and 40%.
T237 4227-4319 Sentence denotes When fewer than 10 sequences were available for a collection date, all sequences were taken.
T238 4320-4397 Sentence denotes Median Hamming distances were computed for each set of subsampled alignments.
T239 4398-4540 Sentence denotes These were bootstrapped 100,000 times, and 95% CIs were estimated and compared to the median Hamming distance for the fully sampled alignment.
T240 4542-4615 Sentence denotes Global and Site-Specific Nonsynonymous and Synonymous Substitution Rates.
T241 4616-4696 Sentence denotes Alignments subsampled at 10% 100 times were used to estimate substitution rates.
T242 4697-4893 Sentence denotes For the set of subsampled alignments for each gene, a mixed-effect likelihood method was used to estimate nonsynonymous (dN) and synonymous (dS) substitution rates globally and at each codon (29).
T243 4894-5116 Sentence denotes Maximum-likelihood phylogenies were constructed for each alignment using the software IQ-TREE (55) under a best-fit model determined with ModelFinder (56) to prime the dN and dS estimates before branch length optimization.
T244 5117-5171 Sentence denotes This step serves to expedite the optimization process.
T245 5172-5285 Sentence denotes Branch length optimization was done with a MG94 model [which is the only model available for this analysis (29)].
T246 5286-5489 Sentence denotes The proportion of each phylogeny evolving under neutral (or negative) selection was determined from the mixture density across lineages for each site, assuming different dN and dS along each branch (57).
T247 5490-5703 Sentence denotes On the same set of subsampled alignments and phylogenies, a fixed-effects likelihood method was used on internal branches to identify sites under pervasive diversifying selection and to estimate global dN/dS (58).
T248 5704-6013 Sentence denotes Known biases associated with calculating dN/dS on exponentially growing populations (59) were counterbalanced by subsampling phylogenies, as the typical approach to address this bias, which is to ignore terminal branches, would considerably diminish the power of the analysis to detect any significant result.
T249 6014-6384 Sentence denotes As P values from the fixed-effect likelihood method are uncorrected, results were not averaged over P values; rather, given that P value calculations are conservative for this analysis (58), sites were considered to be under pervasive diversifying selection if their P value was <0.1 in ≥50% of alignments, which would account for a typical 5% false discovery rate (58).
T250 6386-6438 Sentence denotes Global and Gene-Specific Population Differentiation.
T251 6439-6527 Sentence denotes Alignments subsampled at 10% 100 times were used to estimate population differentiation.
T252 6528-6659 Sentence denotes The genetic differentiation of subpopulations within sampled sequences was calculated on each gene separately using Nei’s (30) GST.
T253 6660-6987 Sentence denotes Because comparisons between subpopulations of different sizes can bias genetic differentiation estimates (60), genetic differentiation was also calculated using Jost’s (31) D, which accounts for differences in genetic heterogeneity between subpopulations and is intended to correct for biases in the size of the subpopulations.
T254 6988-7059 Sentence denotes Both statistics were computed with the mmod package (32) in R (v3.6.1).
T255 7060-7162 Sentence denotes For each gene, statistics were calculated over 100 bootstrapped samples for each subsampled alignment.
T256 7163-7203 Sentence denotes Subpopulations were defined in two ways.
T257 7204-7363 Sentence denotes First, sequences originating from the initial outbreak in the Hubei province (30 sequences) were compared to all other sequences within a subsampled alignment.
T258 7364-7559 Sentence denotes Second, a 1-wk sliding window was designed to compare all sequences sampled prior to a collection date (subpopulation 1) to all sequences sampled after the same collection date (subpopulation 2).
T259 7560-7808 Sentence denotes The first collection date for subpopulation 1 was February 14, 2020, the week after the last sequence from the Hubei province was sampled (February 8, 2020), The window was designed to terminate when <30 sequences were available in subpopulation 2.
T260 7810-7867 Sentence denotes Time-Dependent Estimates of Phylogenetic Diversification.
T261 7868-8207 Sentence denotes Time-dependent estimates of phylogenetic diversification were measured by extracting the branches descending from each internal node (above the root) of each phylogeny and calculating the peak height (η) of the spectral density profile of the graph Laplacian of each subtree, which is a measure of the density of branching events (36, 37).
T262 8208-8322 Sentence denotes The code to perform the analysis is available for download at https://www.hivresearch.org/publication-supplements.
T263 8324-8344 Sentence denotes Simulation analyses.
T264 8345-8656 Sentence denotes Phylogenies were simulated using a time-forward branching process under constant birth rates (b(t)=b) and time-dependent birth rates (b(t)=beαt) for b = 0.01, 0.03, 0.05, 0.07, and 0.09 and α = ±0.01, ±0.11, ±0.21, ±0.31, and ±0.41, for 20, 220, 420, 620, and 820 tips, and for 1, 11, 21, 31, and 41 time units.
T265 8657-8726 Sentence denotes Simulated phylogenies were downsampled at 0%, 10%, 30%, 50%, and 70%.
T266 8727-8777 Sentence denotes For each scenario, 100 phylogenies were simulated.
T267 8778-8899 Sentence denotes Time-dependent diversification (i.e., η across subtrees) was calculated for each phylogeny simulated under each scenario.
T268 8900-9018 Sentence denotes Simulations were conducted using the R packages RPANDA (R Phylogenetic ANalyses of DiversificAtion) (61) and ape (53).
T269 9020-9056 Sentence denotes Comparisons to SARS-CoV-2 phylogeny.
T270 9057-9304 Sentence denotes Phylogenies downsampled at 10% from the full (18,514 tips) SARS-CoV-2 genome phylogeny (following the subsampling strategy described above) were used to calculate the phylogenetic η for each subtree (above the root) for each downsampled phylogeny.
T271 9305-9470 Sentence denotes Neutral phylogenies were simulated under stochastic branching by randomly sampling from the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny.
T272 9471-9535 Sentence denotes This was iterated across all downsampled SARS-CoV-2 phylogenies.
T273 9536-9806 Sentence denotes Positive time-dependent phylogenies were simulated using a time-dependent process (b(t)=0.01eαt) for α = 0.001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 1, with branch lengths restricted to the distribution of branch lengths from one downsampled SARS-CoV-2 phylogeny.
T274 9807-9882 Sentence denotes This was iterated across all downsampled SARS-CoV-2 phylogenies for each α.
T275 9883-9975 Sentence denotes Neutral and positive time-dependent phylogenies were simulated with a 10% sampling fraction.
T276 9976-10010 Sentence denotes Polytomies were randomly resolved.
T277 10011-10084 Sentence denotes Simulations were conducted using the R packages RPANDA (61) and ape (53).
T278 10086-10130 Sentence denotes Ancestral S Protein Sequence Reconstruction.
T279 10131-10533 Sentence denotes Ancestral S protein sequences were reconstructed from an amino acid alignment of 30 SARS-CoV-2 sequences sampled from the Hubei province, a coronavirus sampled from bat (Yunnan RaTG13), and six SARS-CoV-2-like coronaviruses sampled from pangolins using maximum posterior probability and returning a unique residue at each site assuming a Jones-Taylor-Thornton (JTT) model with gamma heterogeneity (62).
T280 10534-10610 Sentence denotes The JTT model was the most appropriate model available in the software (62).
T281 10611-10715 Sentence denotes The bat sequence was retrieved from GenBank, and the pangolin sequences were retrieved from GISAID (63).
T282 10716-10930 Sentence denotes A sliding window of 10 amino acids (and a step of 1 amino acid) was used to compare the cumulative number of mutations in the human−bat and human−bat−pangolin ancestors with respect to the human ancestral sequence.
T283 10931-11148 Sentence denotes Median values for each window were compared to a null window (computed as a normal distribution of 10 values with a mean equal to the mean value across the entire S protein, 0.046 mutations) using a one-tailed t test.
T284 11149-11268 Sentence denotes An alignment including the reconstructed sequences is available at https://www.hivresearch.org/publication-supplements.
T285 11270-11314 Sentence denotes Prediction of CD4+ and CD8+ T Cell Epitopes.
T286 11315-11400 Sentence denotes CD4+ and CD8+ T cell epitopes were predicted for four SARS-CoV-2 structural proteins:
T287 11401-11516 Sentence denotes S (accession YP_009724390), N (accession YP_009724397), M (accession YP_009724393), and E (accession YP_009724392).
T288 11517-11723 Sentence denotes CD4+ T cell epitopes were predicted using a server that predicts binding of peptides to any MHC molecule of known sequence using artificial neural networks, NetMHCIIPan 4.0 (64) with a peptide length of 15.
T289 11724-11967 Sentence denotes MHC class II HLA alleles of HLA-DQB1, plus the haplotypes of HLA-DPA1-DPB1 and HLA-DQA1-DPB, were selected for predictions if they had frequencies of >1/1,000 in known allele/haplotype distributions (http://17ihiw.org/17th-ihiw-ngs-hla-data/).
T290 11968-12079 Sentence denotes If multiple peptides had the same core, the peptide with the strongest binding score was selected for analysis.
T291 12080-12168 Sentence denotes CD8+ T cell epitopes were predicted using NetMHCPan 4.1 (64) with a peptide length of 9.
T292 12169-12423 Sentence denotes MHC class I HLA alleles of HLA-A, HLA-B, and HLA-C were selected if they were classified as common (frequency ≥ 1/10,000) in any of the populations in the database CIWD 3.0 (Common, Intermediate and Well-Documented HLA Alleles in World Populations) (65).
T293 12424-12536 Sentence denotes Epitopes predicted as strong binders (with predicted binding affinities below 50 nM) were selected for analyses.
T294 12538-12566 Sentence denotes T Cell Immunogenicity Index.
T295 12567-12807 Sentence denotes For each site in a predicted epitope, the immunogenicity index was defined as the sum of the frequency of the HLA alleles or haplotypes restricting the corresponding epitope (multiple epitopes can be predicted at a given site in a protein).
T296 12808-13148 Sentence denotes Total frequencies from CIWD 3.0 were used as the frequencies of the corresponding MHC class I HLA alleles (HLA-A, HLA-B, and HLA-C), and the global frequencies from http://17ihiw.org/17th-ihiw-ngs-hla-data/ were used as the frequencies of the corresponding MHC class II HLA alleles or haplotypes (HLA-DQB1, HLA-DPA1-DPB1, and HLA-DQA1-DPB).
T297 13149-13298 Sentence denotes This procedure was repeated using the frequencies of MHC alleles or haplotypes in different subpopulations listed in the above HLA frequency dataset.
T298 13300-13321 Sentence denotes Statistical Analyses.
T299 13322-13409 Sentence denotes For comparisons of mean values in normally distributed data, Student’s t test was used.
T300 13410-13470 Sentence denotes When data were not normal, the Mann−Whitney U test was used.
T301 13471-13523 Sentence denotes Shapiro−Wilk tests were used to determine normality.
T302 13524-13607 Sentence denotes Differences in data distributions were estimated using the Kolmogorov−Smirnov test.