Limited Diversity across 18,514 SARS-CoV-2 Genomes. To characterize SARS-CoV-2 diversification since the beginning of the epidemic, we aligned 27,977 SARS-CoV-2 genome sequences isolated from infected individuals in 84 countries. The alignment was curated to retain independent sequences that covered over 95% of the ORFs. In addition, because sequences from the United Kingdom constituted 47% of the dataset (n = 12,157), we sampled a representative set of 5,000 UK sequences, yielding a final dataset of 18,514 SARS-CoV-2 genomes (SI Appendix, Fig. S1 and Fig. 1A). Fig. 1. SARS-CoV-2 diversity across 18,514 genomes. (A) Distribution representing the location and date of sample collection. (B) Location and frequency of sites with polymorphisms across the genome. Proportion of sequences that showed polymorphisms compared to the reference sequence, Wuhan-Hu-1 (GISAID: EPI_ISL_402215, GenBank: NC_045512). ORFs are shown in gray for nonstructural proteins and in color for structural proteins (S, purple; E, blue; M, green; N, red). (C) Global phylogeny of 18,514 independent genome sequences. The tree was rooted at the reference sequence, Wuhan-Hu-1, and tips are colored by collection location. The scale indicates the distance corresponding to one substitution. Lineages are labeled following PANGOLIN (22). There were 7,559 polymorphic sites (that is, sites where at least one sequence has a change relative to the reference sequence) across the genome (total length: 29,409 nucleotides). Most substitutions were found in a single sequence; only 8.41% (n = 2,474) of the polymorphic sites showed substitutions in two or more sequences (Fig. 1B). Only 11 mutations were found in >5% of sequences, and only 7 were found in >10% of sequences (3 of which were adjacent). The mean pairwise diversity across genomes was 0.025%, ranging between 0.01% for E to 0.11% for N. A phylogenetic tree reconstructed based on all genome sequences reflected the global spread of the virus: Samples from the first 6 wk of the outbreak were collected predominantly from China (Fig. 1C). As the epidemic has progressed, samples have been increasingly obtained across Europe and from the United States (Fig. 1 A and C). The tree shows numerous introductions of different variants across the globe, with introductions from distant locations seeding local epidemics, where infections sometimes went unrecognized for several weeks and allowed wider spread (23). Prior to the severe travel restrictions that were seen in March 2020, intense travel patterns between China, Europe, and the United States allowed transmission of a myriad of variants, which is currently reflected by different lineages in the tree. Yet, the tree topology shows minimal structure, even at the genome level, indicating that SARS-CoV-2 viruses have not diverged significantly since the beginning of the pandemic. To compare how genomes differed from one another, we calculated Hamming distances (which correspond to the number of differences between two genomes) across all pairs of sequences. These Hamming distances showed a narrow distribution, with a median of seven substitutions between two independent genomes, while linked sequences sampled in cruise ships had a median of two substitutions (SI Appendix, Fig. S2). Surprisingly, Hamming distances across genomes sampled in the United States did not show a similar quasi-normal distribution but instead a bimodal distribution, observed despite the large number of sequences compared (n = 5,398). We identified that this bimodal distribution was driven by sequences from Washington State, possibly reflecting separate introductions in that state. Nonetheless, such a bimodal distribution could also indicate a bias in the sampling of sequences (SI Appendix, Fig. S2).