Results

Limited Diversity across 18,514 SARS-CoV-2 Genomes.
To characterize SARS-CoV-2 diversification since the beginning of the epidemic, we aligned 27,977 SARS-CoV-2 genome sequences isolated from infected individuals in 84 countries. The alignment was curated to retain independent sequences that covered over 95% of the ORFs. In addition, because sequences from the United Kingdom constituted 47% of the dataset (n = 12,157), we sampled a representative set of 5,000 UK sequences, yielding a final dataset of 18,514 SARS-CoV-2 genomes (SI Appendix, Fig. S1 and Fig. 1A).
Fig. 1.  SARS-CoV-2 diversity across 18,514 genomes. (A) Distribution representing the location and date of sample collection. (B) Location and frequency of sites with polymorphisms across the genome. Proportion of sequences that showed polymorphisms compared to the reference sequence, Wuhan-Hu-1 (GISAID: EPI_ISL_402215, GenBank: NC_045512). ORFs are shown in gray for nonstructural proteins and in color for structural proteins (S, purple; E, blue; M, green; N, red). (C) Global phylogeny of 18,514 independent genome sequences. The tree was rooted at the reference sequence, Wuhan-Hu-1, and tips are colored by collection location. The scale indicates the distance corresponding to one substitution. Lineages are labeled following PANGOLIN (22). There were 7,559 polymorphic sites (that is, sites where at least one sequence has a change relative to the reference sequence) across the genome (total length: 29,409 nucleotides). Most substitutions were found in a single sequence; only 8.41% (n = 2,474) of the polymorphic sites showed substitutions in two or more sequences (Fig. 1B). Only 11 mutations were found in >5% of sequences, and only 7 were found in >10% of sequences (3 of which were adjacent). The mean pairwise diversity across genomes was 0.025%, ranging between 0.01% for E to 0.11% for N. A phylogenetic tree reconstructed based on all genome sequences reflected the global spread of the virus: Samples from the first 6 wk of the outbreak were collected predominantly from China (Fig. 1C). As the epidemic has progressed, samples have been increasingly obtained across Europe and from the United States (Fig. 1 A and C). The tree shows numerous introductions of different variants across the globe, with introductions from distant locations seeding local epidemics, where infections sometimes went unrecognized for several weeks and allowed wider spread (23). Prior to the severe travel restrictions that were seen in March 2020, intense travel patterns between China, Europe, and the United States allowed transmission of a myriad of variants, which is currently reflected by different lineages in the tree. Yet, the tree topology shows minimal structure, even at the genome level, indicating that SARS-CoV-2 viruses have not diverged significantly since the beginning of the pandemic. To compare how genomes differed from one another, we calculated Hamming distances (which correspond to the number of differences between two genomes) across all pairs of sequences. These Hamming distances showed a narrow distribution, with a median of seven substitutions between two independent genomes, while linked sequences sampled in cruise ships had a median of two substitutions (SI Appendix, Fig. S2). Surprisingly, Hamming distances across genomes sampled in the United States did not show a similar quasi-normal distribution but instead a bimodal distribution, observed despite the large number of sequences compared (n = 5,398). We identified that this bimodal distribution was driven by sequences from Washington State, possibly reflecting separate introductions in that state. Nonetheless, such a bimodal distribution could also indicate a bias in the sampling of sequences (SI Appendix, Fig. S2).

One S Mutation (D614G) Has Become Dominant.
Since the beginning of the pandemic, two mutations across the genome have become consensus: P4715L in ORF1ab (nucleotide 14,143, C to T) and D614G in S (nucleotide 23,403, A to G) (Fig. 1B) (a third consensus mutation, at nucleotide 3,037, is not reported as the site was masked during our sequence-filtering procedure). These mutations were found in 69.3% and 69.4% of sequences, respectively, and are in linkage (Fig. 2B). Given the importance of S for virus entry and as a target of the host neutralizing response, the biologic implications of the D614G mutation are under intense scrutiny (24–28). This mutation was first observed in a sequence from China dated January 24, with seven more sequences sampled until February 8. Then, the D614G mutation was not observed in China until March 13. In contrast, the D614G mutation was introduced in Europe at the end of January (first sequence identified in Germany, dated January 28), and it rapidly became dominant on that continent and at every location where the virus subsequently spread (Fig. 2A). The phylogenetic tree (Fig. 2B) and the distribution of sequences (Fig. 2C) are suggestive of a founder effect. The rapid spread of sequences carrying the D614G mutation implies that the growing viral population should become more homogeneous, that is, the frequency of sequences with shared polymorphisms will increase. We found a median of seven substitutions (based on a comparison of 18,514 sequences) between two independent SARS-CoV-2 genomes (SI Appendix, Fig. S2). Yet, genomes with the D614G mutation showed a median of five substitutions, whereas those with D at position 614 differed by eight substitutions (Fig. 2D).
Fig. 2.  The S mutation D614G quickly became dominant. The mutation D614G was found in 69% of sequences sampled globally as of May 18, 2020, the second most frequent mutation in S was only found in ∼2% of sequences. (A) Number of sequences with D (gray) or G (purple) by continent and sampling date shown cumulatively through the outbreak. (B) Phylogenetic tree reconstructed from all of the ORFs showing the linkage between D614G in S and P4715L in ORF1ab. Tips are colored by continent. The phylogeny suggests that these mutations were linked to a bottleneck event when SARS-CoV-2 viruses were introduced in Europe; this mutation was first seen in Europe in a sequence sampled in Germany at the end of January. There is no evidence that the increasing predominance of this mutation was caused by convergent selection events that would have occurred in multiple individuals. (C) Overall number of sequences with D614 or D614G across continents; the predominance of D614G in Europe is suggestive of a founder event. (D) Distribution of Hamming distances between sequences with D614, G614 or discordant pairs. The median is marked with a dashed line. To test whether this site was under selection, we used likelihood-based, phylogenetically informed models that assume branch-specific substitution rates (29) and implemented a sampling strategy to circumvent computational limitations imposed by the large number of sequences. Subsampled alignments (100 times at a 10% sampling fraction) had diversity estimates statistically similar to the complete alignment for each gene (Mann−Whitney U test, P > 0.09; SI Appendix, Fig. S3). In S, only site 614 was estimated to be under diversifying selection in a majority of subsampled alignments (58%); evidence of diversifying selection indicates that genetic diversity increases in the viral population (i.e., there was a higher proportion of mutations causing an amino acid change than not at site 614, or, the nonsynonymous/synonymous substitution rates ratio, dN/dS, was over 1, P < 0.1) (SI Appendix, Fig. S4). Because diversifying selection is often associated with the host adaptive response, we considered whether the D614G mutation coincided with targets of antibody and T cell responses. Site 614 is at the interface between the S1 and S2 subunits and is thus not highly accessible to antibodies (SI Appendix, Fig. S5). Therefore, we predict that antibodies to the native S protein would cross-react with S containing the D614G mutation, in agreement with recent reports (24, 25, 27, 28). Many known neutralizing antibodies target the RBD, yet we found little evidence that mutations could affect binding to the ACE2 receptor, as only five shared mutations were identified at contact sites with the ACE2 receptor, and all were found in 10 or fewer sequences. Of these, one mutation, at position 489, was synonymous and found in three sequences (0.02%). The others were nonsynonymous: G476S (n = 10 sequences, 0.05%), Y453F (n = 5, 0.02%), G446V (n = 3, 0.02%), and A475V (n = 2, 0.01%). To predict the potential immune pressure linked to T cell responses, we developed a T cell immunogenicity index which takes into account the CD8 and CD4 epitope repertoires in the structural proteins of SARS-CoV-2 (S, N, M, E) and the frequency of human leukocyte antigen (HLA) alleles or haplotypes in a given population. We found that sites with mutations, including 614 in S, were not colocalized with T cell epitopes frequently identified in different populations (SI Appendix, Figs. S6 and S7), and there was no significant relationship between the number of mutations and the immunogenicity index (SI Appendix, Fig. S8).

Most Sites in the SARS-CoV-2 Genome Were under Purifying Selection.
Using phylogenetically informed models (as described above), we identified two sites, residue 614 in S and 13 in N, that were under diversifying selection in a majority of subsampled alignments. For each protein, subsampled alignments tended to have more sites under purifying selection (median = 7.34 ± 4.06% [±SD]) than under diversifying selection (3.10 ± 1.92%) (Mann−Whitney U test, P = 0.057; SI Appendix, Fig. S4) (purifying selection is indicative of a decrease in genetic diversity in the population). Likewise, for each codon separately, the proportion of each phylogeny (i.e., the percentage of total branch length) with dN/dS > 1 was small, indicating diversifying selection was episodic and limited (Fig. 3A). Global measures of dN/dS varied across genes, ranging from 0.35 ± 0.02 (M) to 1.43 ± 0.24 (ORF10), and were significantly lower for structural genes compared to nonstructural genes (Mann−Whitney U test, P = 0.042) (Fig. 3B). Per-lineage nonsynonymous substitution rates were comparable (Student’s t test, P = 0.218) in structural (0.0011 ± 0.021) and nonstructural (0.0012 ± 0.028) genes, although some subsampled alignments showed rates that could be a hundred times higher than the median over all alignments (Fig. 3C). Across structural proteins, mutations were disproportionately neutral: >70.3% of branch length evolved under neutral (or negative) selection for all sites, and over half of all branch length evolved under neutral (or negative) selection for >82.8% of sites (Fig. 3D) (29).
Fig. 3.  Evolution across the SARS-CoV-2 genome. (A) Bar plot of the average percentage of branch length under diversifying selection (dN/dS > 1) for each site. (B) Bar plot of dN/dS per gene (dN = dS is shown as dashed line). Error bars indicate SD across subsampled alignments. (C) Box plot of nonsynonymous substitutions per lineage per site across structural and nonstructural genes. Values across subsampled alignments for each gene are plotted. (D) Average percentage (over subsampled alignments) of branch lengths evolving under neutral (or negative) selection per site for each structural gene. Median values are shown by dashed lines.

No Evidence of Differentiation of the Viral Population.
While there was only limited evidence of diversification at selected sites, we also assessed whether subpopulations among the globally circulating viral population had become genetically differentiated over time. To do so, we used two measures of population differentiation, the GST and D statistics, which characterize changes in allele frequency across populations and can show fitness differences between subpopulations (30–32). Genetic distances between two subpopulations can range between 0 and 1, indicating no and complete differentiation, respectively. We initially compared 30 genomes sampled from the initial outbreak in Wuhan, China, with subsampled alignments of the 18,484 genomes sampled subsequently across the globe. Although distances varied across genes, the median genetic distance between these subpopulations was small for both GST (0.0049 ± 0.0047) and D (0.0053 ± 0.0272), indicating little differentiation between the initial outbreak and its global derivatives in the pandemic (Fig. 4A). We then compared subpopulations sampled before and after each consecutive week. Similarly, genetic distances between subpopulations were small for both GST (0.0058 ± 0.0096) and D (0.0098 ± 0.0650) and tended to narrow over time rather than diverge (Fig. 4 B and C). Signatures of host adaptation can also be seen in the branching patterns of viral phylogenies. Bursts in transmissibility are emblematic of increases in relative viral fitness and are reflected in imbalances in the phylogeny, which can be estimated at each internal node (SI Appendix, Figs. S9 and S10) (33–35). We estimated phylogenetic η (36, 37) at each internal node of the SARS-CoV-2 phylogeny reconstructed from subsampled (10%) alignments and compared the distribution of estimates through time to phylogenies simulated under models of neutral and positive time-dependent rates (b(t) = beα(t)). Simulation analyses demonstrated that this metric was robust against sampling fraction (SI Appendix, Fig. S10). The distribution of η in the SARS-CoV-2 phylogenies adhered to expectations of the neutral model and deviated significantly (Student’s t test, P < 0.001) from those of positive time-dependent rates for selection coefficients α ≥ 0.2 (Fig. 4 D and E). Together, the SARS-CoV-2 population and phylogenetic dynamics showed little evidence that the global spread of SARS-CoV-2 was related to viral fitness effects.
Fig. 4.  Limited evidence of adaptation of the viral population. (A–C) Bootstrapped global estimates of Nei’s GST and Jost’s D for population differentiation for each structural gene. (A) Estimates of Nei’s GST (closed circles) and Jost’s D (open circles) comparing sequences sampled from the Hubei province to sequences subsequently sampled globally. Estimates of (B) Nei’s GST and (C) Jost’s D comparing sequences sampled before or after a specific date. Lines connect the median estimates across datasets for each gene. (D) Ln-transformed phylogenetic η, indicative of the number of iterative events in the sampled subtree, for subtrees from each internal node (after the root) of a down-sampled SARS-CoV-2 whole-genome phylogeny (dark gray), of a phylogeny simulated under neutral parameters (gold), and of a phylogeny simulated under positive time-dependent rates (b(t) = 0.01e0.4t, green). (E) Box plot of ln-transformed phylogenetic η estimates across all down-sampled SARS-CoV-2 whole-genome phylogenies, phylogenies simulated under neutral parameters, and phylogenies simulated under different positive time dependencies, α. Asterisks indicate significant differences in mean values (Student’s t test, P < 0.05) between the SARS-CoV-2 and positive time-dependent phylogenies at each α.

Sequence Identity with Potential Vaccine Candidates.
Typical vaccine design strategies rely on either 1) selecting sequences sampled from infected individuals or 2) computationally deriving sequences that cover the diversity seen across circulating sequences and are, in theory, optimal compared to an individual isolate (38). Computationally derived sequences include consensus and ancestral sequences, such as the most recent common ancestor (MRCA) of a set of sequences. We inferred the MRCA corresponding to 1) SARS-CoV-2 S sequences sampled from Wuhan within the first month of the epidemic, 2) all currently circulating SARS-CoV-2 sequences, and 3) all SARS-CoV-2 sequences together with closely related sequences sampled from pangolins (n = 6) and a bat. There were 17 mutations between the human MRCA and the human−bat MRCA and 44 mutations between the human MRCA and human−pangolin MRCA. Overall, three segments in S reflected significant variability across species (AA 439 to 445, 482 to 501, and 676 to 690) (Fig. 5 A and C and SI Appendix, Fig. S11) (39). In contrast, when considering only human sequences, SARS-CoV-2 diversity was limited: Both MRCAs (derived from early sequences from Wuhan or from all circulating sequences) were identical to the initial reference sequence Wuhan-Hu-1. Comparing these sequences to the consensus sequence derived from all of the sequences sampled to date, there was only one mutation: D614G (Fig. 5B). Fig. 5D illustrates that mutations found across circulating S sequences were rare: Besides D614G (found in 69.4% of sequences), the next most frequent substitution is found in 1.96% of sequences (synonymous), with sequences sampled from infected individuals, on average, 0.55 mutations away from the consensus sequence (consisting of 0.12 synonymous and 0.43 nonsynonymous mutations). Across the genome, there were, on average, 4.05 nucleotide mutations per individual genome when compared to the consensus, with only P4715L and D614G found in >50% of sequences.
Fig. 5.  Mutations across SARS-CoV-2 S sequences. (A) Structure of SARS-CoV (5 × 58) (shown instead of SARS-CoV-2 for completeness of the Receptor Binding Motif [RBM]). (B–D) The three protomers in the closed SARS-CoV-2 S glycoprotein (Protein Data Bank ID code 6VXX) are colored in yellow, cyan, and white. Sites with mutations are shown as spheres. (B) Near-identity of potential vaccine candidates. The MRCA and Wuhan-Hu-1 reference sequences were identical, while the consensus derived from all circulating sequences showed a mutation (D614G). Site 614 is located at the interface between two subunits. (C) Sequence segments that differed between human and pangolin or bat hosts. Amino acid segments 439 to 455 and 482 to 501 impact receptor binding, while the 574 to 690 segment corresponds to the S2 cleavage site. (D) Sites with shared mutations across SARS-CoV-2 circulating sequences. The colors of the spheres correspond to the proportion of SARS-CoV-2 sequences that differed from the Wuhan-Hu-1 sequence (GISAID: EPI_ISL_402125, GenBank: NC_045512). Mutations that were found only in one or two sequences were not represented.