TEST0

PMC:5985360 / 12088-14146 JSON TXT

Whole-Genome-Sequence-Based Haplotypes Reveal Single Origin of the Sickle Allele during the Holocene Wet Phase Abstract Five classical designations of sickle haplotypes are made on the basis of the presence or absence of restriction sites and are named after the ethno-linguistic groups or geographic regions from which the individuals with sickle cell anemia originated. Each haplotype is thought to represent an independent occurrence of the sickle mutation rs334 (c.20A>T [p.Glu7Val] in HBB). We investigated the origins of the sickle mutation by using whole-genome-sequence data. We identified 156 carriers from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We classified haplotypes by using 27 polymorphisms in linkage disequilibrium with rs334. Network analysis revealed a common haplotype that differed from the ancestral haplotype only by the derived sickle mutation at rs334 and correlated collectively with the Central African Republic (CAR), Cameroon, and Arabian/Indian haplotypes. Other haplotypes were derived from this haplotype and fell into two clusters, one composed of Senegal haplotypes and the other composed of Benin and Senegal haplotypes. The near-exclusive presence of the original sickle haplotype in the CAR, Kenya, Uganda, and South Africa is consistent with this haplotype predating the Bantu expansions. Modeling of balancing selection indicated that the heterozygote advantage was 15.2%, an equilibrium frequency of 12.0% was reached after 87 generations, and the selective environment predated the mutation. The posterior distribution of the ancestral recombination graph yielded a sickle mutation age of 259 generations, corresponding to 7,300 years ago during the Holocene Wet Phase. These results clarify the origin of the sickle allele and improve and simplify the classification of sickle haplotypes. Introduction Several hereditary variants in the hemoglobin genes afford protection against malaria (MIM: 611162). Many such variants are thought to have evolved in the last 10,000 years.1, 2 In the beta globin gene HBB (MIM: 141900), the sickle (MIM: 603903) allele βS is under balancing selection because of heterozygote advantage, which results from protection against malaria caused by Plasmodium falciparum and recessive lethality. The chromosomal background of the βS allele has been classified on the basis of the presence or absence of a set of seven canonical restriction sites, 5′ ε HincII → Gγ1 HindIII → Aγ1 HindIII → ψβ HincII → 3′ ψβ HincII → β AvaII → 3′ β BamHI, yielding five haplotypes, Arabian/Indian, Benin, Cameroon, CAR, and Senegal, named after ethno-linguistic groups or geographic regions, as well as a sixth category for “atypical” haplotypes.3, 4, 5, 6, 7, 8, 9 These designations do not necessarily indicate where the haplotypes originated. Whether the βS allele has a recent or old origin has been debated since the development of restriction fragment length polymorphism (RFLP) data.10, 11 According to the multicentric model, the origin of the βS allele is recent—within the last few thousand years—and each haplotype represents an independent occurrence of the same exact mutation in the corresponding geographic region.4, 12, 13, 14 In contrast, according to the unicentric model, the origin of the βS allele is anywhere from tens to hundreds of thousands of years old, and the mutation occurred once.15, 16, 17, 18 Suggested places of origin include equatorial Africa19 and the Middle East.20, 21 A 1.2 kb recombination hotspot exists 1 kb upstream of HBB.22 Consequently, recombination and gene conversion, rather than de novo mutation, have generated several haplotypes.3, 23, 24 We investigated the origins of the sickle allele by using whole-genome-sequence data from a total of 2,932 individuals from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We identified a total of 156 sickle carriers. We classified haplotypes by using phased sequence data comprising diallelic sites (both single-nucleotide polymorphisms and short insertions or deletions). We then used a combination of forward time simulation, phylogenetic network analysis, and coalescent analysis to infer a single origin of the sickle allele approximately 7,300 years ago, during the Holocene Wet Phase or Green Sahara. Material and Methods Ethics Statement This project was excluded from institutional-review-board review by the Office of Human Subjects Research Protections, National Institutes of Health (OHSRP ID# 17-NHGRI-00282). Sequence Data We retrieved whole-genome-sequence data from 2,504 individuals in the 1000 Genomes Project,25 320 individuals in the African Genome Variation Project,26 and 108 individuals from Qatar.27 As previously detailed by the 1000 Genomes Project, a three-stage approach was used to establish phased haplotypes: (1) given array-based genotype data and family information, SHAPEIT2 was used to estimate a scaffold of long-range phased haplotypes, (2) a combination of Beagle and SHAPEIT2 was used to jointly analyze the long-range phased haplotypes and di-allelic variants, and (3) MVNcall was used to place multi-allelic and structural variants onto the haplotype scaffold.25 The average switch error rate was 0.56%, and the average distance between phasing errors was 1,062 kb.25 For comparison, the length of the β-globin locus, including the locus control region, is 67 kb. All data were accessed with VCFtools version 0.1.14.28 We assessed linkage disequilibrium in the 1000 Genomes Project data by using the --hap-r2 function for phased haplotypes. Measuring Multi-locus Association We calculated Cramér’s V for the association between two nominal variables. In our analyses, the number of levels equaled the number of haplotypes. V2 is the square of the mean canonical correlation and is equivalent to Pearson’s r2 if at least one variable has only two levels. We implemented a bias-corrected version of Cramér’s V given by V˜=φ˜2/m˜, φ˜2=max(0,φ2−((r−1)(c−1)/n−1)), φ2=χ2/n, m˜=min(r˜−1,c˜−1), r˜=r−((r−1)2/n−1), c˜=c−((c−1)2/n−1), where n is the sample size and χ2 is the chi-square statistic without a continuity correction for a contingency table with r rows and c columns.29 African Ancestry We used YFitter to call Y chromosome haplogroups30 and HaploFind to call mitochondrial DNA haplogroups.31 We used projection analysis in ADMIXTURE version 1.332 with a reference panel of 21 global ancestries33 to analyze autosomal ancestry. To determine standard errors for the proportions of ancestral components for each individual, we reran ADMIXTURE with the addition of 200 bootstrap replicates. Accounting for both within and between individual variances, we calculated the proportions for average ancestry by using inverse variance weights. We then calculated 95% confidence intervals for each ancestry and individual, zeroed out any average proportions for which the 95% confidence intervals included 0, and renormalized the remaining averages to sum to 1. Balancing Selection Let the genotype frequencies of the sickle homozygote, heterozygote, and wild-type homozygote be p2, 2pq, and q2, respectively. Let the corresponding relative fitnesses be 0, 1+s, and 1, respectively.2 Then, at equilibrium, s=(p/1−2p). For each of the five continental African samples in the 1000 Genomes Project Phase 3 release version 5a, we estimated the effective population size Ne on the basis of the heterozygosities of all single-nucleotide polymorphisms (i.e., diallelic, triallelic, and quadrallelic) by assuming a mutation rate of 0.97×10−8 mutations per site per generation.34 We then took the harmonic mean of the five Ne estimates. Assuming one initial copy of the mutant allele, we simulated 1,000 generations under a combination of random genetic drift and balancing selection. We repeated this process 1,000 times. Phylogenetic Network Analysis Phylogenetic trees are based on models of evolution that assume a purely bifurcating process. This assumption is violated by several evolutionary processes, including recombination, gene conversion, and recurrent mutation. A single phylogenetic tree is unable to represent incompatible signals across sites. In contrast, a phylogenetic network is a more general form of graph that relaxes the assumption of bifurcation. Split decomposition is an inferential method that computes a set of incompatible splits from a given distance matrix. Trivial splits separate a set of taxa into two sets, one set containing a single taxon and the other set containing all other taxa, thereby defining terminal branches. Non-trivial splits separate a set of taxa into two sets, both containing at least two taxa, thereby defining internal branches. A split network depicts incompatible splits as parallel branches. We used the Neighbor-Net method implemented in SplitsTree version 4.13.1 to perform split decomposition analysis of haplotypes.35 Inferring the Ancestral Recombination Graph We used ARGweaver to infer the ancestral recombination graph.36 ARGweaver is based on the standard coalescent model and is sensitive to balancing selection, such that regions under balancing selection have older times to the most recent common ancestor than comparable neutral regions. To account for uncertainty in both the mutation and recombination rates, we used a grid approach. We tested mutation rates of 0.72×10−8, 0.97×10−8, and 1.44×10−8 mutations per generation per site.34 We tested recombination rates of 5×10−9, 1×10−8, 1.5×10−8 (the default value for human data), 2×10−8, and 2×10−7 recombinations per generation per site.34 We ran the sampler for 3,000 iterations and saved the ancestral recombination graph from every tenth iteration. We used the functions heidel.diag and geweke.diag in the coda library of R, version 3.2.3, to assess convergence diagnostics on the basis of the posterior distribution of the number of recombination events.37 To convert generations into years, we assumed a generation interval of 28 years.38, 39 Results Molecular Mapping of Restriction Sites We mapped 15 restriction sites, including the seven canonical sites, to the reference human genome sequence. We identified 12 known markers that predict the presence or absence of ten of these sites. Of the canonical sites, we predicted 5′ ε HincII with rs3834466, Gγ1 HindIII with rs2070972, Aγ1 HindIII with rs28440105, and 3′ ψβ HincII with rs968857 (Table 1); similar results were described in a previously reported analysis of rs3834466, rs28440105, rs10128556, and rs968857.40 Pairwise correlation (measured via r2) between these variants and rs334 was weak to nonexistent (Table 1). On the basis of these four RFLP-predicting markers, we identified ten unique haplotypes. V˜2 between the set of ten haplotypes and rs334 was 0.079 (Figure 1). Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334. Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334. Distributions of βS and the Classical Haplotypes In the 1000 Genomes Project data, we identified 137 sickle carriers and 0 sickle homozygotes; we predicted the classical haplotypes for all 137 carriers (Table 2). The Benin haplotype was the predominant haplotype in the samples of Esan and Yoruba from Nigeria, the CAR haplotype was the predominant haplotype in the sample of Luhya from Kenya, and the Senegal haplotype was the predominant haplotype in the samples of Mende from Sierra Leone and Mandinka from the Gambia. There was limited representation of the Arabian/Indian (one) and Cameroon (two) haplotypes. The average sickle allele frequency was 12.0% and did not statistically differ among the five continental African samples (χ42=1.48, p=0.830; Table S1). The distribution of matrilines comprised 1 A2, 2 B2, 1 J2, 8 L0, 24 L1, 43 L2, 49 L3, 3 L4, 2 L5, 1 T2, and 3 U6 haplogroups. The distribution of patrilines comprised 2 A1a, 5 E1a, 54 E1b1a, 1 E1b1b, 2 E2b, 1 G2a, 1 I2a, and 2 R1b haplogroups. Of the 54 E1b1a, 29 were E1b1a1a1f and 23 were E1b1a1a1g. Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria. In the African Genome Variation Project data, we identified 14 sickle carriers in the Baganda and one sickle carrier in the Zulu. We predicted that all 15 of these individuals carried the CAR haplotype (Table 2). In the Qatar sample, we identified four sickle carriers, all with insufficient information to predict the haplotypes. Haplotype Classification Based on Linkage Disequilibrium We defined haplotypes centered on rs334 in the 504 continental Africans from the 1000 Genomes Project. We recorded pairwise linkage disequilibrium (LD) between rs334 and all phased, diallelic markers on chromosome 11 (Figure 2). The largest value of r2 with rs334 was 0.407. There was only one marker, rs149481026, with r2≥0.4. This one marker was more strongly associated with rs334 (V˜2=0.406) than the set of four RFLP-predicting markers was (Figure 1A). Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers. To strengthen the association with rs334, we investigated lower levels of r2. There were three markers with r2≥0.3: rs183055323, rs149481026, and rs73404549. These three markers ranged across 132.6 kb, including the entire β-globin cluster. This interval was substantially smaller than the average distance between phasing errors. On the basis of these three markers, we identified five unique haplotypes (Table 3). One haplotype contained all occurrences of the Arabian/Indian, Cameroon, and CAR haplotypes. This haplotype contained the ancestral allele at all three markers. The Senegal haplotype was distributed across all five haplotypes, and the Benin haplotype was distributed across four of the five haplotypes. V˜2 between these three markers and rs334 was 0.753 (Figure 1B). Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. To improve cross-classification with the Senegal and Benin haplotypes, we identified 27 markers with r2≥0.2. These markers extended across 725.3 kb, which is still less than the average distance between phasing errors. On the basis of these markers, we identified 59 unique haplotypes, of which 18 carried the sickle allele at rs334. A 19th sickle haplotype was observed once in the ACB sample, and a 20th sickle haplotype was observed once in the Baganda sample (Table 4). V˜2 between these 27 markers and rs334 was 0.728 (Figure 1C). The most common haplotype carried the ancestral allele at all 28 sites and accounted for 68.5% of all haplotypes in the continental Africans. Globally, the ancestral haplotype had a frequency of 91.9%, was the most frequent haplotype in all 26 samples in the 1000 Genomes Project, and was the only haplotype in 15 of those samples, including all ten samples from East Asia and Europe. 13 of the sickle haplotypes in the Baganda, the one in the Zulu, and all four in the Qatari were identical to HAP1, the haplotype most commonly designated CAR (Table 4). Additionally, the four Qatari carriers had (1) >7.8% autosomal African ancestry, (2) an African mitochondrial haplogroup, or (3) an African Y chromosome haplogroup. The three most common haplotypes (HAP1, HAP16, and HAP20) correlated primarily with the CAR, Benin, and Senegal haplotypes, respectively (Table 4). Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. Given that our sets of 1, 3, and 27 markers more strongly associated with rs334 than did the set of RFLP-predicting markers, we assessed association between our sets of markers and the set of RFLP-predicting markers. Our sets of markers were moderately associated with the set of RFLP-predicting markers when conditioned on the presence of βS and weakly associated when conditioned on the presence of βA (Figure 1). Bioinformatic Annotation Different haplotypes might be associated with different clinical phenotypes or disease severity.41 Using Ensembl and HaploReg version 4.1,42 we annotated each of the 27 markers in linkage disequilibrium at r2≥0.2 with rs334 (Table S1). Nine sites were marked on histones as promoters or enhancers, possibly implicating differential expression of HBB, HBD (MIM: 142000), HBE1 (MIM: 142100), or HBG2 (MIM: 142250). Three of these nine sites, rs334, rs73402608, and rs112336602, were also annotated as bound by proteins, possibly implicating the same four genes. In addition, rs1039215 is correlated with gene expression, most strongly with HBG2 (Table S2). Balancing Selection We modeled balancing selection assuming that the relative fitness of the βA/βA homozygote was 1, the relative fitness of the βS/βS homozygote was 0, and the relative fitness of the βA/βS heterozygote was 1+s. On the basis of the 504 continental Africans from the 1000 Genomes Project, we estimated that s=0.158, which was in agreement with previous estimates.14 Next, by assuming a single initial copy and an effective population size Ne=25,542, we modeled random genetic drift plus balancing selection to estimate how many generations it would take for an equilibrium frequency of 12.0% to be reached. We found that the mutant allele was lost 74.6% of the time and, conditional on reaching equilibrium, reached a frequency of 12.0% after an average of 87 (95% confidence interval [68,124]) generations, or approximately 2,400 (95% confidence interval from 1,900 to 3,500) years. We stress that this value is not the age of the sickle mutation nor the age since the onset of balancing selection but rather the time to reach a frequency of 12.0%. To determine the fate of the mutant allele in the absence of heterozygote advantage, we repeated the simulation while assuming s=0. We found that the mutant allele was lost after an average of 12 generations (95% confidence interval [1,92]), with a median of two generations. Phylogenetic Network Analysis We used split decomposition analysis to infer the phylogeny of the 20 sickle haplotypes, whereby the phylogeny was rooted by the ancestral haplotype. (Figure 3). The network revealed that the sickle mutation occurred once in the background of the ancestral haplotype and gave rise to HAP1, which is associated predominantly with the CAR haplotype. One cluster containing HAP1 and HAP10 accounted for all occurrences of the Cameroon, CAR, and Arabian/Indian haplotypes (Table 4). Two additional clusters were derived from this haplotype and not from the ancestral haplotype. One cluster contained HAP6, HAP9, HAP19, and HAP20; all associated with the Senegal haplotype (Table 4). In this cluster, the modal haplotype, HAP20, differed from HAP1 by 15 derived mutations (Table 4 and Table S1). This cluster accounted for 70.5% of occurrences of the Senegal haplotype. The other cluster contained the remaining 14 haplotypes, accounting for all occurrences of the Benin haplotype and 29.5% of the occurrences of the Senegal haplotype. In this cluster, the modal haplotype, HAP16, differed from HAP1 by 13 derived mutations (Table 4; Table S1). Other than the sickle mutation, only one derived mutation, at rs10655224, was shared between HAP16 and HAP20. The remaining 14 derived mutations in HAP20 had an average frequency of 11.3% in the Gambian in Western Divisions in the Gambia (GWD) and Mende in Sierra Leone (MSL) samples and 0.1% in the Esan in Nigeria (ESN) and Yoruba in Ibadan, Nigeria (YRI) samples (Table S1). Conversely, the remaining 12 derived mutations in HAP16 had an average frequency of 13.4% in the ESN and YRI samples and 2.2% in the GWD and MSL samples (Table S1). These 26 derived alleles had an average frequency of 0.4% in the Luhya in Webuye, Kenya (LWK) sample (Table S1). Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site. The Ancestral Recombination Graph Using coalescent theory, we sampled the posterior distribution of the ancestral recombination graph with the 1,008 haplotypes, including 121 sickle haplotypes, from the 504 continental Africans from the 1000 Genomes Project. Over a grid of 15 pairs of mutation and recombination rates, a mutation rate of 0.97×10−8 mutations per generation per site and a recombination rate of 1.5×10−8 recombinations per generation per site yielded the best fit to the data. Given these two rates, the trace of the posterior number of recombination events visually indicated convergence to a stationary distribution (Figure S1). More formally, Geweke’s diagnostic indicated that the sampled values came from a stationary distribution (p=0.063). Heidelberger and Welch’s diagnostic also indicated that the sampled values came from a stationary distribution (p=0.163) and further indicated that there was no need to discard initial iterations. We estimated the age of the sickle mutation as 259 (95% credible interval [123,395]) generations, or approximately 7,300 years (95% credible interval from 3,400 to 11,100 years). Discussion There are two models of the origins of the sickle allele. The multicentric model posits five independent occurrences of the same mutation within the last few thousand years. The unicentric model posits a single occurrence and an older age. We used whole-genome-sequence data to provide insight into this issue. Using haplotypic classification and phylogenetic network analysis, we found clear and consistent evidence for a single origin of the sickle mutation. After accounting for recombination, we estimated that the sickle mutation is 259 [123,395] generations old. Starting from first principles of population genetics, we defined haplotypes on the basis of phased sequence data. In contrast, the classical haplotypes were based on patterns of restriction sites seen in individuals with sickle cell anemia.3, 4 By molecularly mapping restriction sites to the sequence data, we found that the restriction sites correlated poorly with rs334, such that none of the canonical sites were included in our sequence-based haplotypes. This discrepancy can be explained by the fact that the RFLP-defining markers circulate on both βS- and βA-carrying chromosomes, even though the restriction sites were originally used for defining types of βS-carrying chromosomes. In contrast, our haplotypes correlate strongly with rs334 and simultaneously correlate moderately with the restriction sites conditional on the presence of the βS allele, thereby capturing the classical patterns. There are four lines of evidence regarding the age of the sickle mutation: historical data, the simulations of balancing selection, the patriline data, and the analysis of the ancestral recombination graph. Although historical records are sparse, the earliest recorded cases of fevers that could have been caused by malaria were ∼5,000 years ago in China and were possibly due to Plasmodium vivax.2, 43, 44 The earliest recorded case of illness that could have been malaria specifically caused by Plasmodium falciparum could have been ∼4,000 years ago in Egypt and Sumer; however, Plasmodium falciparum could have been present in Africa several thousand years earlier.2 The first recorded cases of sickle cell anemia or, more broadly, sickle cell disease were in Egypt during the predynastic period (∼3200 BC45), in the Persian Gulf during the Hellenistic period (2,130 years before present46), and in Ghana in 1670 AD.47 The existence of the sickle allele in predynastic Egypt constrains the lower bound of the age of the sickle mutation to be 5,200 years. On the basis of these limited data, it is historically plausible that the selective environment preceded the sickle mutation, consistent with our balancing-selection simulations, which showed that the sickle allele would have been lost almost immediately without a heterozygote advantage (under the assumption of recessive lethality). The patriline data corroborate the age of the sickle mutation. Given a single origin, the sickle mutation most likely arose either in a population containing both E1b1a1a1f-L485 (also known as E1b1a1a1a1c-L485) and E1b1a1a1g-U175 (also known as E1b1a1a1a2a-U175) or in a population containing their common ancestor. Thus, the time of origin of the common ancestor of these two haplogroups is a plausible upper bound on the age of the sickle mutation. The common ancestor E1b1a1a1a-M4732 arose 10,500 (95% confidence interval between 9,200 and 12,000) years ago, consistent with the upper credible interval of 11,100 years ago that we estimated for the age of the sickle mutation. The Bantu expansions started ∼5,000 years ago and took ∼2,000 years to cross from present-day Cameroon to the Great Lakes region.48 The wide credible interval (3,400 to 11,100 years) of the age of the sickle mutation does not rule out the possibility that the mutation coincided with the onset of the Bantu expansions. However, our results support the hypothesis that the origin of the βS allele predated the onset of the Bantu expansions. First, the βS allele could have been at or near its equilibrium frequency when the Bantu expansions occurred ∼5,000 years ago, according to our estimation that the sickle mutation originated ∼7,300 years ago and that balancing selection ensued immediately and reached equilibrium in ∼2,400 years in an environment where malaria preexisted. Second, the likelihood that the βS allele is carried by migrating individuals increases with the frequency of that allele in the source population. Our results provide some suggestions as to where the sickle mutation might have originated. Descendants of the Y chromosome haplogroup E1b1a-V38 migrated across the Sahara from east to west,49 possibly around 19,000 years ago.50 E1b1a1-M2 most likely did not originate in eastern or northeastern Africa, but where it originated in either western or central Africa is unclear.49 Accordingly, the sickle mutation most likely did not occur in eastern or northeastern Africa. Our results indicate that the origin of the sickle mutation was in the middle of the Holocene Wet Phase, or Neolithic Subpluvial, which lasted from ∼7,500–7,000 BC to ∼3,500–3,000 BC. This time was the most recent of the Green Sahara periods, during which the Sahara experienced wet and rainy conditions.51 Our results thus support the Green Sahara as a possible place of origin of the sickle mutation. An alternative hypothesis is that the sickle allele arose in west-central Africa,13, 52 possibly in the northwestern portion of the equatorial rainforest. Our results also provide some suggestions as to where the three clusters might have originated. Two splits occurred early in the original βS-carrying population. The first split defined one cluster containing HAP1 and accounting for the Cameroon and CAR haplotypes. It is plausible that HAP1 was carried from an area in or around present-day Cameroon to the area that is presently the CAR, as well as to areas east and south, as part of the Bantu expansions. However, the Bantu expansions did not extend west and north. The second split subsequently separated the clusters containing HAP16 and HAP20, the modal haplotypes accounting for the Benin and Senegal haplotypes, respectively. HAP16 and HAP20 shared one derived mutation, consistent with an early split. Furthermore, given the subsequent accumulation of derived mutations not shared between HAP16 and HAP20, effectively no gene flow occurred between these two descendant populations, consistent with geographic separation. We therefore hypothesize that the common ancestor of these two clusters existed north of Cameroon among non-Bantu-speaking peoples in or around present-day Nigeria. From this common ancestral population, a group of migrants separated and traveled west and north to the area around present-day Senegal and the Gambia. These migrants could have taken a coastal or an inland route. The finding that the Senegal haplotype was the predominant haplotype in the sample of Mende from Sierra Leone is consistent with a coastal route. We do not have data to investigate an inland route, but we note that the frequency of the sickle allele is higher along the coast than inland.53 Classical haplotypes in eastern Arabia tend to have the Arabian/Indian designation, whereas those in western Arabia tend to have the Benin designation.18, 21 The Arabian/Indian haplotype has been hypothesized to have originated in either east Saudi Arabia or India.5 Although our samples include only one predicted instance of the Arabian/Indian haplotype, the occurrence of this haplotype in the Luhya in Kenya and its clustering with the predominant haplotype found in Kenya and Uganda are consistent with the hypothesis that the Arabian/Indian haplotype originated in Africa and had an overseas migration route from eastern Africa to eastern Arabia and India.13, 19 In contrast, the absence of the Benin haplotype in the Luhya in Kenya and the Baganda in Uganda provides evidence against an overseas migration route from eastern Africa to western Arabia. Instead, the presence of the Benin haplotype in western Arabia is consistent with an African origin and an overland migration route through northeast Africa.33 Together, our results suggest the following evolutionary history of the sickle mutation. The presence of African ancestry, an African patriline, and/or an African matriline in all sickle carriers, combined with the absence of Arabian or Indian ancestries in the five continental African samples in the 1000 Genomes Project,33 supports an African origin of the sickle mutation. The sickle mutation occurred once approximately 7,300 years ago either in the Sahara or in west-central Africa. In an environment where malaria preexisted, balancing selection took approximately 2,400 years to drive the βS allele to an equilibrium frequency of 12.0%. A population split occurred, possibly in the area around present-day Cameroon. Starting approximately 5,000 years ago, the Bantu expansions resulted in the spread of HAP1 to the CAR, the Great Lakes region, and South Africa. After acquiring one derived allele, the population split into one carrying a haplotype that evolved into HAP16 and another carrying a haplotype that evolved into HAP20. In both clusters, hitchhiking of the derived alleles with the βS allele resulted in similar derived allele frequencies. Our study includes sequence data from western, west-central, and eastern Africa but lacks comparable data from northern, central, and southern Africa. A haplotype has been identified among Sudanese individuals that was classified as atypical but might be a fifth African haplotype.54 It is possible that sampling additional populations could provide more evidence regarding when, where, and in whom the sickle mutation arose. Our findings support classification of sickle haplotypes based on three clusters. Notably, we found that the Senegal haplotype is substructured into two clusters, one containing only Senegal haplotypes and one containing Benin and Senegal haplotypes. This substructuring of haplotypes might have confounded previous assessments of clinical phenotype or disease severity. At the same time, the identification of markers linked to the sickle allele provides an opportunity to investigate possible cis-effects, such as differential expression of HBD, HBE1, and HBG2, on disease severity. Web Resources 1000 Genomes Project, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ African Genome Variation Project, https://www.ebi.ac.uk/ega/studies/EGAS00001000959 Ensembl, http://www.ensembl.org/Homo_sapiens/Info/Index E YTree, https://www.yfull.com/tree/E Qatar sequence data, Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/?term=SRP060765 OMIM, http://www.omim.org Supplemental Data Document S1. Figure S1 Spreadsheet S1. Tables S1 and S2 Document S2. Article plus Supplemental Data Acknowledgments We acknowledge the assistance of Neil Hanchard with the restriction site data. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official view of the National Institutes of Health. This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (1ZIAHG200362). Supplemental Data include one figure and two tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.02.003.

Document structure show

article-title	Whole-Genome-Sequence-Based Haplotypes Reveal Single Origin of the Sickle Allele during the Holocene Wet Phase
abstract	Five classical designations of sickle haplotypes are made on the basis of the presence or absence of restriction sites and are named after the ethno-linguistic groups or geographic regions from which the individuals with sickle cell anemia originated. Each haplotype is thought to represent an independent occurrence of the sickle mutation rs334 (c.20A>T [p.Glu7Val] in HBB). We investigated the origins of the sickle mutation by using whole-genome-sequence data. We identified 156 carriers from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We classified haplotypes by using 27 polymorphisms in linkage disequilibrium with rs334. Network analysis revealed a common haplotype that differed from the ancestral haplotype only by the derived sickle mutation at rs334 and correlated collectively with the Central African Republic (CAR), Cameroon, and Arabian/Indian haplotypes. Other haplotypes were derived from this haplotype and fell into two clusters, one composed of Senegal haplotypes and the other composed of Benin and Senegal haplotypes. The near-exclusive presence of the original sickle haplotype in the CAR, Kenya, Uganda, and South Africa is consistent with this haplotype predating the Bantu expansions. Modeling of balancing selection indicated that the heterozygote advantage was 15.2%, an equilibrium frequency of 12.0% was reached after 87 generations, and the selective environment predated the mutation. The posterior distribution of the ancestral recombination graph yielded a sickle mutation age of 259 generations, corresponding to 7,300 years ago during the Holocene Wet Phase. These results clarify the origin of the sickle allele and improve and simplify the classification of sickle haplotypes.
p	Five classical designations of sickle haplotypes are made on the basis of the presence or absence of restriction sites and are named after the ethno-linguistic groups or geographic regions from which the individuals with sickle cell anemia originated. Each haplotype is thought to represent an independent occurrence of the sickle mutation rs334 (c.20A>T [p.Glu7Val] in HBB). We investigated the origins of the sickle mutation by using whole-genome-sequence data. We identified 156 carriers from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We classified haplotypes by using 27 polymorphisms in linkage disequilibrium with rs334. Network analysis revealed a common haplotype that differed from the ancestral haplotype only by the derived sickle mutation at rs334 and correlated collectively with the Central African Republic (CAR), Cameroon, and Arabian/Indian haplotypes. Other haplotypes were derived from this haplotype and fell into two clusters, one composed of Senegal haplotypes and the other composed of Benin and Senegal haplotypes. The near-exclusive presence of the original sickle haplotype in the CAR, Kenya, Uganda, and South Africa is consistent with this haplotype predating the Bantu expansions. Modeling of balancing selection indicated that the heterozygote advantage was 15.2%, an equilibrium frequency of 12.0% was reached after 87 generations, and the selective environment predated the mutation. The posterior distribution of the ancestral recombination graph yielded a sickle mutation age of 259 generations, corresponding to 7,300 years ago during the Holocene Wet Phase. These results clarify the origin of the sickle allele and improve and simplify the classification of sickle haplotypes.
body	Introduction Several hereditary variants in the hemoglobin genes afford protection against malaria (MIM: 611162). Many such variants are thought to have evolved in the last 10,000 years.1, 2 In the beta globin gene HBB (MIM: 141900), the sickle (MIM: 603903) allele βS is under balancing selection because of heterozygote advantage, which results from protection against malaria caused by Plasmodium falciparum and recessive lethality. The chromosomal background of the βS allele has been classified on the basis of the presence or absence of a set of seven canonical restriction sites, 5′ ε HincII → Gγ1 HindIII → Aγ1 HindIII → ψβ HincII → 3′ ψβ HincII → β AvaII → 3′ β BamHI, yielding five haplotypes, Arabian/Indian, Benin, Cameroon, CAR, and Senegal, named after ethno-linguistic groups or geographic regions, as well as a sixth category for “atypical” haplotypes.3, 4, 5, 6, 7, 8, 9 These designations do not necessarily indicate where the haplotypes originated. Whether the βS allele has a recent or old origin has been debated since the development of restriction fragment length polymorphism (RFLP) data.10, 11 According to the multicentric model, the origin of the βS allele is recent—within the last few thousand years—and each haplotype represents an independent occurrence of the same exact mutation in the corresponding geographic region.4, 12, 13, 14 In contrast, according to the unicentric model, the origin of the βS allele is anywhere from tens to hundreds of thousands of years old, and the mutation occurred once.15, 16, 17, 18 Suggested places of origin include equatorial Africa19 and the Middle East.20, 21 A 1.2 kb recombination hotspot exists 1 kb upstream of HBB.22 Consequently, recombination and gene conversion, rather than de novo mutation, have generated several haplotypes.3, 23, 24 We investigated the origins of the sickle allele by using whole-genome-sequence data from a total of 2,932 individuals from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We identified a total of 156 sickle carriers. We classified haplotypes by using phased sequence data comprising diallelic sites (both single-nucleotide polymorphisms and short insertions or deletions). We then used a combination of forward time simulation, phylogenetic network analysis, and coalescent analysis to infer a single origin of the sickle allele approximately 7,300 years ago, during the Holocene Wet Phase or Green Sahara. Material and Methods Ethics Statement This project was excluded from institutional-review-board review by the Office of Human Subjects Research Protections, National Institutes of Health (OHSRP ID# 17-NHGRI-00282). Sequence Data We retrieved whole-genome-sequence data from 2,504 individuals in the 1000 Genomes Project,25 320 individuals in the African Genome Variation Project,26 and 108 individuals from Qatar.27 As previously detailed by the 1000 Genomes Project, a three-stage approach was used to establish phased haplotypes: (1) given array-based genotype data and family information, SHAPEIT2 was used to estimate a scaffold of long-range phased haplotypes, (2) a combination of Beagle and SHAPEIT2 was used to jointly analyze the long-range phased haplotypes and di-allelic variants, and (3) MVNcall was used to place multi-allelic and structural variants onto the haplotype scaffold.25 The average switch error rate was 0.56%, and the average distance between phasing errors was 1,062 kb.25 For comparison, the length of the β-globin locus, including the locus control region, is 67 kb. All data were accessed with VCFtools version 0.1.14.28 We assessed linkage disequilibrium in the 1000 Genomes Project data by using the --hap-r2 function for phased haplotypes. Measuring Multi-locus Association We calculated Cramér’s V for the association between two nominal variables. In our analyses, the number of levels equaled the number of haplotypes. V2 is the square of the mean canonical correlation and is equivalent to Pearson’s r2 if at least one variable has only two levels. We implemented a bias-corrected version of Cramér’s V given by V˜=φ˜2/m˜, φ˜2=max(0,φ2−((r−1)(c−1)/n−1)), φ2=χ2/n, m˜=min(r˜−1,c˜−1), r˜=r−((r−1)2/n−1), c˜=c−((c−1)2/n−1), where n is the sample size and χ2 is the chi-square statistic without a continuity correction for a contingency table with r rows and c columns.29 African Ancestry We used YFitter to call Y chromosome haplogroups30 and HaploFind to call mitochondrial DNA haplogroups.31 We used projection analysis in ADMIXTURE version 1.332 with a reference panel of 21 global ancestries33 to analyze autosomal ancestry. To determine standard errors for the proportions of ancestral components for each individual, we reran ADMIXTURE with the addition of 200 bootstrap replicates. Accounting for both within and between individual variances, we calculated the proportions for average ancestry by using inverse variance weights. We then calculated 95% confidence intervals for each ancestry and individual, zeroed out any average proportions for which the 95% confidence intervals included 0, and renormalized the remaining averages to sum to 1. Balancing Selection Let the genotype frequencies of the sickle homozygote, heterozygote, and wild-type homozygote be p2, 2pq, and q2, respectively. Let the corresponding relative fitnesses be 0, 1+s, and 1, respectively.2 Then, at equilibrium, s=(p/1−2p). For each of the five continental African samples in the 1000 Genomes Project Phase 3 release version 5a, we estimated the effective population size Ne on the basis of the heterozygosities of all single-nucleotide polymorphisms (i.e., diallelic, triallelic, and quadrallelic) by assuming a mutation rate of 0.97×10−8 mutations per site per generation.34 We then took the harmonic mean of the five Ne estimates. Assuming one initial copy of the mutant allele, we simulated 1,000 generations under a combination of random genetic drift and balancing selection. We repeated this process 1,000 times. Phylogenetic Network Analysis Phylogenetic trees are based on models of evolution that assume a purely bifurcating process. This assumption is violated by several evolutionary processes, including recombination, gene conversion, and recurrent mutation. A single phylogenetic tree is unable to represent incompatible signals across sites. In contrast, a phylogenetic network is a more general form of graph that relaxes the assumption of bifurcation. Split decomposition is an inferential method that computes a set of incompatible splits from a given distance matrix. Trivial splits separate a set of taxa into two sets, one set containing a single taxon and the other set containing all other taxa, thereby defining terminal branches. Non-trivial splits separate a set of taxa into two sets, both containing at least two taxa, thereby defining internal branches. A split network depicts incompatible splits as parallel branches. We used the Neighbor-Net method implemented in SplitsTree version 4.13.1 to perform split decomposition analysis of haplotypes.35 Inferring the Ancestral Recombination Graph We used ARGweaver to infer the ancestral recombination graph.36 ARGweaver is based on the standard coalescent model and is sensitive to balancing selection, such that regions under balancing selection have older times to the most recent common ancestor than comparable neutral regions. To account for uncertainty in both the mutation and recombination rates, we used a grid approach. We tested mutation rates of 0.72×10−8, 0.97×10−8, and 1.44×10−8 mutations per generation per site.34 We tested recombination rates of 5×10−9, 1×10−8, 1.5×10−8 (the default value for human data), 2×10−8, and 2×10−7 recombinations per generation per site.34 We ran the sampler for 3,000 iterations and saved the ancestral recombination graph from every tenth iteration. We used the functions heidel.diag and geweke.diag in the coda library of R, version 3.2.3, to assess convergence diagnostics on the basis of the posterior distribution of the number of recombination events.37 To convert generations into years, we assumed a generation interval of 28 years.38, 39 Results Molecular Mapping of Restriction Sites We mapped 15 restriction sites, including the seven canonical sites, to the reference human genome sequence. We identified 12 known markers that predict the presence or absence of ten of these sites. Of the canonical sites, we predicted 5′ ε HincII with rs3834466, Gγ1 HindIII with rs2070972, Aγ1 HindIII with rs28440105, and 3′ ψβ HincII with rs968857 (Table 1); similar results were described in a previously reported analysis of rs3834466, rs28440105, rs10128556, and rs968857.40 Pairwise correlation (measured via r2) between these variants and rs334 was weak to nonexistent (Table 1). On the basis of these four RFLP-predicting markers, we identified ten unique haplotypes. V˜2 between the set of ten haplotypes and rs334 was 0.079 (Figure 1). Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334. Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334. Distributions of βS and the Classical Haplotypes In the 1000 Genomes Project data, we identified 137 sickle carriers and 0 sickle homozygotes; we predicted the classical haplotypes for all 137 carriers (Table 2). The Benin haplotype was the predominant haplotype in the samples of Esan and Yoruba from Nigeria, the CAR haplotype was the predominant haplotype in the sample of Luhya from Kenya, and the Senegal haplotype was the predominant haplotype in the samples of Mende from Sierra Leone and Mandinka from the Gambia. There was limited representation of the Arabian/Indian (one) and Cameroon (two) haplotypes. The average sickle allele frequency was 12.0% and did not statistically differ among the five continental African samples (χ42=1.48, p=0.830; Table S1). The distribution of matrilines comprised 1 A2, 2 B2, 1 J2, 8 L0, 24 L1, 43 L2, 49 L3, 3 L4, 2 L5, 1 T2, and 3 U6 haplogroups. The distribution of patrilines comprised 2 A1a, 5 E1a, 54 E1b1a, 1 E1b1b, 2 E2b, 1 G2a, 1 I2a, and 2 R1b haplogroups. Of the 54 E1b1a, 29 were E1b1a1a1f and 23 were E1b1a1a1g. Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria. In the African Genome Variation Project data, we identified 14 sickle carriers in the Baganda and one sickle carrier in the Zulu. We predicted that all 15 of these individuals carried the CAR haplotype (Table 2). In the Qatar sample, we identified four sickle carriers, all with insufficient information to predict the haplotypes. Haplotype Classification Based on Linkage Disequilibrium We defined haplotypes centered on rs334 in the 504 continental Africans from the 1000 Genomes Project. We recorded pairwise linkage disequilibrium (LD) between rs334 and all phased, diallelic markers on chromosome 11 (Figure 2). The largest value of r2 with rs334 was 0.407. There was only one marker, rs149481026, with r2≥0.4. This one marker was more strongly associated with rs334 (V˜2=0.406) than the set of four RFLP-predicting markers was (Figure 1A). Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers. To strengthen the association with rs334, we investigated lower levels of r2. There were three markers with r2≥0.3: rs183055323, rs149481026, and rs73404549. These three markers ranged across 132.6 kb, including the entire β-globin cluster. This interval was substantially smaller than the average distance between phasing errors. On the basis of these three markers, we identified five unique haplotypes (Table 3). One haplotype contained all occurrences of the Arabian/Indian, Cameroon, and CAR haplotypes. This haplotype contained the ancestral allele at all three markers. The Senegal haplotype was distributed across all five haplotypes, and the Benin haplotype was distributed across four of the five haplotypes. V˜2 between these three markers and rs334 was 0.753 (Figure 1B). Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. To improve cross-classification with the Senegal and Benin haplotypes, we identified 27 markers with r2≥0.2. These markers extended across 725.3 kb, which is still less than the average distance between phasing errors. On the basis of these markers, we identified 59 unique haplotypes, of which 18 carried the sickle allele at rs334. A 19th sickle haplotype was observed once in the ACB sample, and a 20th sickle haplotype was observed once in the Baganda sample (Table 4). V˜2 between these 27 markers and rs334 was 0.728 (Figure 1C). The most common haplotype carried the ancestral allele at all 28 sites and accounted for 68.5% of all haplotypes in the continental Africans. Globally, the ancestral haplotype had a frequency of 91.9%, was the most frequent haplotype in all 26 samples in the 1000 Genomes Project, and was the only haplotype in 15 of those samples, including all ten samples from East Asia and Europe. 13 of the sickle haplotypes in the Baganda, the one in the Zulu, and all four in the Qatari were identical to HAP1, the haplotype most commonly designated CAR (Table 4). Additionally, the four Qatari carriers had (1) >7.8% autosomal African ancestry, (2) an African mitochondrial haplogroup, or (3) an African Y chromosome haplogroup. The three most common haplotypes (HAP1, HAP16, and HAP20) correlated primarily with the CAR, Benin, and Senegal haplotypes, respectively (Table 4). Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. Given that our sets of 1, 3, and 27 markers more strongly associated with rs334 than did the set of RFLP-predicting markers, we assessed association between our sets of markers and the set of RFLP-predicting markers. Our sets of markers were moderately associated with the set of RFLP-predicting markers when conditioned on the presence of βS and weakly associated when conditioned on the presence of βA (Figure 1). Bioinformatic Annotation Different haplotypes might be associated with different clinical phenotypes or disease severity.41 Using Ensembl and HaploReg version 4.1,42 we annotated each of the 27 markers in linkage disequilibrium at r2≥0.2 with rs334 (Table S1). Nine sites were marked on histones as promoters or enhancers, possibly implicating differential expression of HBB, HBD (MIM: 142000), HBE1 (MIM: 142100), or HBG2 (MIM: 142250). Three of these nine sites, rs334, rs73402608, and rs112336602, were also annotated as bound by proteins, possibly implicating the same four genes. In addition, rs1039215 is correlated with gene expression, most strongly with HBG2 (Table S2). Balancing Selection We modeled balancing selection assuming that the relative fitness of the βA/βA homozygote was 1, the relative fitness of the βS/βS homozygote was 0, and the relative fitness of the βA/βS heterozygote was 1+s. On the basis of the 504 continental Africans from the 1000 Genomes Project, we estimated that s=0.158, which was in agreement with previous estimates.14 Next, by assuming a single initial copy and an effective population size Ne=25,542, we modeled random genetic drift plus balancing selection to estimate how many generations it would take for an equilibrium frequency of 12.0% to be reached. We found that the mutant allele was lost 74.6% of the time and, conditional on reaching equilibrium, reached a frequency of 12.0% after an average of 87 (95% confidence interval [68,124]) generations, or approximately 2,400 (95% confidence interval from 1,900 to 3,500) years. We stress that this value is not the age of the sickle mutation nor the age since the onset of balancing selection but rather the time to reach a frequency of 12.0%. To determine the fate of the mutant allele in the absence of heterozygote advantage, we repeated the simulation while assuming s=0. We found that the mutant allele was lost after an average of 12 generations (95% confidence interval [1,92]), with a median of two generations. Phylogenetic Network Analysis We used split decomposition analysis to infer the phylogeny of the 20 sickle haplotypes, whereby the phylogeny was rooted by the ancestral haplotype. (Figure 3). The network revealed that the sickle mutation occurred once in the background of the ancestral haplotype and gave rise to HAP1, which is associated predominantly with the CAR haplotype. One cluster containing HAP1 and HAP10 accounted for all occurrences of the Cameroon, CAR, and Arabian/Indian haplotypes (Table 4). Two additional clusters were derived from this haplotype and not from the ancestral haplotype. One cluster contained HAP6, HAP9, HAP19, and HAP20; all associated with the Senegal haplotype (Table 4). In this cluster, the modal haplotype, HAP20, differed from HAP1 by 15 derived mutations (Table 4 and Table S1). This cluster accounted for 70.5% of occurrences of the Senegal haplotype. The other cluster contained the remaining 14 haplotypes, accounting for all occurrences of the Benin haplotype and 29.5% of the occurrences of the Senegal haplotype. In this cluster, the modal haplotype, HAP16, differed from HAP1 by 13 derived mutations (Table 4; Table S1). Other than the sickle mutation, only one derived mutation, at rs10655224, was shared between HAP16 and HAP20. The remaining 14 derived mutations in HAP20 had an average frequency of 11.3% in the Gambian in Western Divisions in the Gambia (GWD) and Mende in Sierra Leone (MSL) samples and 0.1% in the Esan in Nigeria (ESN) and Yoruba in Ibadan, Nigeria (YRI) samples (Table S1). Conversely, the remaining 12 derived mutations in HAP16 had an average frequency of 13.4% in the ESN and YRI samples and 2.2% in the GWD and MSL samples (Table S1). These 26 derived alleles had an average frequency of 0.4% in the Luhya in Webuye, Kenya (LWK) sample (Table S1). Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site. The Ancestral Recombination Graph Using coalescent theory, we sampled the posterior distribution of the ancestral recombination graph with the 1,008 haplotypes, including 121 sickle haplotypes, from the 504 continental Africans from the 1000 Genomes Project. Over a grid of 15 pairs of mutation and recombination rates, a mutation rate of 0.97×10−8 mutations per generation per site and a recombination rate of 1.5×10−8 recombinations per generation per site yielded the best fit to the data. Given these two rates, the trace of the posterior number of recombination events visually indicated convergence to a stationary distribution (Figure S1). More formally, Geweke’s diagnostic indicated that the sampled values came from a stationary distribution (p=0.063). Heidelberger and Welch’s diagnostic also indicated that the sampled values came from a stationary distribution (p=0.163) and further indicated that there was no need to discard initial iterations. We estimated the age of the sickle mutation as 259 (95% credible interval [123,395]) generations, or approximately 7,300 years (95% credible interval from 3,400 to 11,100 years). Discussion There are two models of the origins of the sickle allele. The multicentric model posits five independent occurrences of the same mutation within the last few thousand years. The unicentric model posits a single occurrence and an older age. We used whole-genome-sequence data to provide insight into this issue. Using haplotypic classification and phylogenetic network analysis, we found clear and consistent evidence for a single origin of the sickle mutation. After accounting for recombination, we estimated that the sickle mutation is 259 [123,395] generations old. Starting from first principles of population genetics, we defined haplotypes on the basis of phased sequence data. In contrast, the classical haplotypes were based on patterns of restriction sites seen in individuals with sickle cell anemia.3, 4 By molecularly mapping restriction sites to the sequence data, we found that the restriction sites correlated poorly with rs334, such that none of the canonical sites were included in our sequence-based haplotypes. This discrepancy can be explained by the fact that the RFLP-defining markers circulate on both βS- and βA-carrying chromosomes, even though the restriction sites were originally used for defining types of βS-carrying chromosomes. In contrast, our haplotypes correlate strongly with rs334 and simultaneously correlate moderately with the restriction sites conditional on the presence of the βS allele, thereby capturing the classical patterns. There are four lines of evidence regarding the age of the sickle mutation: historical data, the simulations of balancing selection, the patriline data, and the analysis of the ancestral recombination graph. Although historical records are sparse, the earliest recorded cases of fevers that could have been caused by malaria were ∼5,000 years ago in China and were possibly due to Plasmodium vivax.2, 43, 44 The earliest recorded case of illness that could have been malaria specifically caused by Plasmodium falciparum could have been ∼4,000 years ago in Egypt and Sumer; however, Plasmodium falciparum could have been present in Africa several thousand years earlier.2 The first recorded cases of sickle cell anemia or, more broadly, sickle cell disease were in Egypt during the predynastic period (∼3200 BC45), in the Persian Gulf during the Hellenistic period (2,130 years before present46), and in Ghana in 1670 AD.47 The existence of the sickle allele in predynastic Egypt constrains the lower bound of the age of the sickle mutation to be 5,200 years. On the basis of these limited data, it is historically plausible that the selective environment preceded the sickle mutation, consistent with our balancing-selection simulations, which showed that the sickle allele would have been lost almost immediately without a heterozygote advantage (under the assumption of recessive lethality). The patriline data corroborate the age of the sickle mutation. Given a single origin, the sickle mutation most likely arose either in a population containing both E1b1a1a1f-L485 (also known as E1b1a1a1a1c-L485) and E1b1a1a1g-U175 (also known as E1b1a1a1a2a-U175) or in a population containing their common ancestor. Thus, the time of origin of the common ancestor of these two haplogroups is a plausible upper bound on the age of the sickle mutation. The common ancestor E1b1a1a1a-M4732 arose 10,500 (95% confidence interval between 9,200 and 12,000) years ago, consistent with the upper credible interval of 11,100 years ago that we estimated for the age of the sickle mutation. The Bantu expansions started ∼5,000 years ago and took ∼2,000 years to cross from present-day Cameroon to the Great Lakes region.48 The wide credible interval (3,400 to 11,100 years) of the age of the sickle mutation does not rule out the possibility that the mutation coincided with the onset of the Bantu expansions. However, our results support the hypothesis that the origin of the βS allele predated the onset of the Bantu expansions. First, the βS allele could have been at or near its equilibrium frequency when the Bantu expansions occurred ∼5,000 years ago, according to our estimation that the sickle mutation originated ∼7,300 years ago and that balancing selection ensued immediately and reached equilibrium in ∼2,400 years in an environment where malaria preexisted. Second, the likelihood that the βS allele is carried by migrating individuals increases with the frequency of that allele in the source population. Our results provide some suggestions as to where the sickle mutation might have originated. Descendants of the Y chromosome haplogroup E1b1a-V38 migrated across the Sahara from east to west,49 possibly around 19,000 years ago.50 E1b1a1-M2 most likely did not originate in eastern or northeastern Africa, but where it originated in either western or central Africa is unclear.49 Accordingly, the sickle mutation most likely did not occur in eastern or northeastern Africa. Our results indicate that the origin of the sickle mutation was in the middle of the Holocene Wet Phase, or Neolithic Subpluvial, which lasted from ∼7,500–7,000 BC to ∼3,500–3,000 BC. This time was the most recent of the Green Sahara periods, during which the Sahara experienced wet and rainy conditions.51 Our results thus support the Green Sahara as a possible place of origin of the sickle mutation. An alternative hypothesis is that the sickle allele arose in west-central Africa,13, 52 possibly in the northwestern portion of the equatorial rainforest. Our results also provide some suggestions as to where the three clusters might have originated. Two splits occurred early in the original βS-carrying population. The first split defined one cluster containing HAP1 and accounting for the Cameroon and CAR haplotypes. It is plausible that HAP1 was carried from an area in or around present-day Cameroon to the area that is presently the CAR, as well as to areas east and south, as part of the Bantu expansions. However, the Bantu expansions did not extend west and north. The second split subsequently separated the clusters containing HAP16 and HAP20, the modal haplotypes accounting for the Benin and Senegal haplotypes, respectively. HAP16 and HAP20 shared one derived mutation, consistent with an early split. Furthermore, given the subsequent accumulation of derived mutations not shared between HAP16 and HAP20, effectively no gene flow occurred between these two descendant populations, consistent with geographic separation. We therefore hypothesize that the common ancestor of these two clusters existed north of Cameroon among non-Bantu-speaking peoples in or around present-day Nigeria. From this common ancestral population, a group of migrants separated and traveled west and north to the area around present-day Senegal and the Gambia. These migrants could have taken a coastal or an inland route. The finding that the Senegal haplotype was the predominant haplotype in the sample of Mende from Sierra Leone is consistent with a coastal route. We do not have data to investigate an inland route, but we note that the frequency of the sickle allele is higher along the coast than inland.53 Classical haplotypes in eastern Arabia tend to have the Arabian/Indian designation, whereas those in western Arabia tend to have the Benin designation.18, 21 The Arabian/Indian haplotype has been hypothesized to have originated in either east Saudi Arabia or India.5 Although our samples include only one predicted instance of the Arabian/Indian haplotype, the occurrence of this haplotype in the Luhya in Kenya and its clustering with the predominant haplotype found in Kenya and Uganda are consistent with the hypothesis that the Arabian/Indian haplotype originated in Africa and had an overseas migration route from eastern Africa to eastern Arabia and India.13, 19 In contrast, the absence of the Benin haplotype in the Luhya in Kenya and the Baganda in Uganda provides evidence against an overseas migration route from eastern Africa to western Arabia. Instead, the presence of the Benin haplotype in western Arabia is consistent with an African origin and an overland migration route through northeast Africa.33 Together, our results suggest the following evolutionary history of the sickle mutation. The presence of African ancestry, an African patriline, and/or an African matriline in all sickle carriers, combined with the absence of Arabian or Indian ancestries in the five continental African samples in the 1000 Genomes Project,33 supports an African origin of the sickle mutation. The sickle mutation occurred once approximately 7,300 years ago either in the Sahara or in west-central Africa. In an environment where malaria preexisted, balancing selection took approximately 2,400 years to drive the βS allele to an equilibrium frequency of 12.0%. A population split occurred, possibly in the area around present-day Cameroon. Starting approximately 5,000 years ago, the Bantu expansions resulted in the spread of HAP1 to the CAR, the Great Lakes region, and South Africa. After acquiring one derived allele, the population split into one carrying a haplotype that evolved into HAP16 and another carrying a haplotype that evolved into HAP20. In both clusters, hitchhiking of the derived alleles with the βS allele resulted in similar derived allele frequencies. Our study includes sequence data from western, west-central, and eastern Africa but lacks comparable data from northern, central, and southern Africa. A haplotype has been identified among Sudanese individuals that was classified as atypical but might be a fifth African haplotype.54 It is possible that sampling additional populations could provide more evidence regarding when, where, and in whom the sickle mutation arose. Our findings support classification of sickle haplotypes based on three clusters. Notably, we found that the Senegal haplotype is substructured into two clusters, one containing only Senegal haplotypes and one containing Benin and Senegal haplotypes. This substructuring of haplotypes might have confounded previous assessments of clinical phenotype or disease severity. At the same time, the identification of markers linked to the sickle allele provides an opportunity to investigate possible cis-effects, such as differential expression of HBD, HBE1, and HBG2, on disease severity.
sec	Introduction Several hereditary variants in the hemoglobin genes afford protection against malaria (MIM: 611162). Many such variants are thought to have evolved in the last 10,000 years.1, 2 In the beta globin gene HBB (MIM: 141900), the sickle (MIM: 603903) allele βS is under balancing selection because of heterozygote advantage, which results from protection against malaria caused by Plasmodium falciparum and recessive lethality. The chromosomal background of the βS allele has been classified on the basis of the presence or absence of a set of seven canonical restriction sites, 5′ ε HincII → Gγ1 HindIII → Aγ1 HindIII → ψβ HincII → 3′ ψβ HincII → β AvaII → 3′ β BamHI, yielding five haplotypes, Arabian/Indian, Benin, Cameroon, CAR, and Senegal, named after ethno-linguistic groups or geographic regions, as well as a sixth category for “atypical” haplotypes.3, 4, 5, 6, 7, 8, 9 These designations do not necessarily indicate where the haplotypes originated. Whether the βS allele has a recent or old origin has been debated since the development of restriction fragment length polymorphism (RFLP) data.10, 11 According to the multicentric model, the origin of the βS allele is recent—within the last few thousand years—and each haplotype represents an independent occurrence of the same exact mutation in the corresponding geographic region.4, 12, 13, 14 In contrast, according to the unicentric model, the origin of the βS allele is anywhere from tens to hundreds of thousands of years old, and the mutation occurred once.15, 16, 17, 18 Suggested places of origin include equatorial Africa19 and the Middle East.20, 21 A 1.2 kb recombination hotspot exists 1 kb upstream of HBB.22 Consequently, recombination and gene conversion, rather than de novo mutation, have generated several haplotypes.3, 23, 24 We investigated the origins of the sickle allele by using whole-genome-sequence data from a total of 2,932 individuals from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We identified a total of 156 sickle carriers. We classified haplotypes by using phased sequence data comprising diallelic sites (both single-nucleotide polymorphisms and short insertions or deletions). We then used a combination of forward time simulation, phylogenetic network analysis, and coalescent analysis to infer a single origin of the sickle allele approximately 7,300 years ago, during the Holocene Wet Phase or Green Sahara.
title	Introduction
p	Several hereditary variants in the hemoglobin genes afford protection against malaria (MIM: 611162). Many such variants are thought to have evolved in the last 10,000 years.1, 2 In the beta globin gene HBB (MIM: 141900), the sickle (MIM: 603903) allele βS is under balancing selection because of heterozygote advantage, which results from protection against malaria caused by Plasmodium falciparum and recessive lethality. The chromosomal background of the βS allele has been classified on the basis of the presence or absence of a set of seven canonical restriction sites, 5′ ε HincII → Gγ1 HindIII → Aγ1 HindIII → ψβ HincII → 3′ ψβ HincII → β AvaII → 3′ β BamHI, yielding five haplotypes, Arabian/Indian, Benin, Cameroon, CAR, and Senegal, named after ethno-linguistic groups or geographic regions, as well as a sixth category for “atypical” haplotypes.3, 4, 5, 6, 7, 8, 9 These designations do not necessarily indicate where the haplotypes originated.
p	Whether the βS allele has a recent or old origin has been debated since the development of restriction fragment length polymorphism (RFLP) data.10, 11 According to the multicentric model, the origin of the βS allele is recent—within the last few thousand years—and each haplotype represents an independent occurrence of the same exact mutation in the corresponding geographic region.4, 12, 13, 14 In contrast, according to the unicentric model, the origin of the βS allele is anywhere from tens to hundreds of thousands of years old, and the mutation occurred once.15, 16, 17, 18 Suggested places of origin include equatorial Africa19 and the Middle East.20, 21 A 1.2 kb recombination hotspot exists 1 kb upstream of HBB.22 Consequently, recombination and gene conversion, rather than de novo mutation, have generated several haplotypes.3, 23, 24
p	We investigated the origins of the sickle allele by using whole-genome-sequence data from a total of 2,932 individuals from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We identified a total of 156 sickle carriers. We classified haplotypes by using phased sequence data comprising diallelic sites (both single-nucleotide polymorphisms and short insertions or deletions). We then used a combination of forward time simulation, phylogenetic network analysis, and coalescent analysis to infer a single origin of the sickle allele approximately 7,300 years ago, during the Holocene Wet Phase or Green Sahara.
sec	Material and Methods Ethics Statement This project was excluded from institutional-review-board review by the Office of Human Subjects Research Protections, National Institutes of Health (OHSRP ID# 17-NHGRI-00282). Sequence Data We retrieved whole-genome-sequence data from 2,504 individuals in the 1000 Genomes Project,25 320 individuals in the African Genome Variation Project,26 and 108 individuals from Qatar.27 As previously detailed by the 1000 Genomes Project, a three-stage approach was used to establish phased haplotypes: (1) given array-based genotype data and family information, SHAPEIT2 was used to estimate a scaffold of long-range phased haplotypes, (2) a combination of Beagle and SHAPEIT2 was used to jointly analyze the long-range phased haplotypes and di-allelic variants, and (3) MVNcall was used to place multi-allelic and structural variants onto the haplotype scaffold.25 The average switch error rate was 0.56%, and the average distance between phasing errors was 1,062 kb.25 For comparison, the length of the β-globin locus, including the locus control region, is 67 kb. All data were accessed with VCFtools version 0.1.14.28 We assessed linkage disequilibrium in the 1000 Genomes Project data by using the --hap-r2 function for phased haplotypes. Measuring Multi-locus Association We calculated Cramér’s V for the association between two nominal variables. In our analyses, the number of levels equaled the number of haplotypes. V2 is the square of the mean canonical correlation and is equivalent to Pearson’s r2 if at least one variable has only two levels. We implemented a bias-corrected version of Cramér’s V given by V˜=φ˜2/m˜, φ˜2=max(0,φ2−((r−1)(c−1)/n−1)), φ2=χ2/n, m˜=min(r˜−1,c˜−1), r˜=r−((r−1)2/n−1), c˜=c−((c−1)2/n−1), where n is the sample size and χ2 is the chi-square statistic without a continuity correction for a contingency table with r rows and c columns.29 African Ancestry We used YFitter to call Y chromosome haplogroups30 and HaploFind to call mitochondrial DNA haplogroups.31 We used projection analysis in ADMIXTURE version 1.332 with a reference panel of 21 global ancestries33 to analyze autosomal ancestry. To determine standard errors for the proportions of ancestral components for each individual, we reran ADMIXTURE with the addition of 200 bootstrap replicates. Accounting for both within and between individual variances, we calculated the proportions for average ancestry by using inverse variance weights. We then calculated 95% confidence intervals for each ancestry and individual, zeroed out any average proportions for which the 95% confidence intervals included 0, and renormalized the remaining averages to sum to 1. Balancing Selection Let the genotype frequencies of the sickle homozygote, heterozygote, and wild-type homozygote be p2, 2pq, and q2, respectively. Let the corresponding relative fitnesses be 0, 1+s, and 1, respectively.2 Then, at equilibrium, s=(p/1−2p). For each of the five continental African samples in the 1000 Genomes Project Phase 3 release version 5a, we estimated the effective population size Ne on the basis of the heterozygosities of all single-nucleotide polymorphisms (i.e., diallelic, triallelic, and quadrallelic) by assuming a mutation rate of 0.97×10−8 mutations per site per generation.34 We then took the harmonic mean of the five Ne estimates. Assuming one initial copy of the mutant allele, we simulated 1,000 generations under a combination of random genetic drift and balancing selection. We repeated this process 1,000 times. Phylogenetic Network Analysis Phylogenetic trees are based on models of evolution that assume a purely bifurcating process. This assumption is violated by several evolutionary processes, including recombination, gene conversion, and recurrent mutation. A single phylogenetic tree is unable to represent incompatible signals across sites. In contrast, a phylogenetic network is a more general form of graph that relaxes the assumption of bifurcation. Split decomposition is an inferential method that computes a set of incompatible splits from a given distance matrix. Trivial splits separate a set of taxa into two sets, one set containing a single taxon and the other set containing all other taxa, thereby defining terminal branches. Non-trivial splits separate a set of taxa into two sets, both containing at least two taxa, thereby defining internal branches. A split network depicts incompatible splits as parallel branches. We used the Neighbor-Net method implemented in SplitsTree version 4.13.1 to perform split decomposition analysis of haplotypes.35 Inferring the Ancestral Recombination Graph We used ARGweaver to infer the ancestral recombination graph.36 ARGweaver is based on the standard coalescent model and is sensitive to balancing selection, such that regions under balancing selection have older times to the most recent common ancestor than comparable neutral regions. To account for uncertainty in both the mutation and recombination rates, we used a grid approach. We tested mutation rates of 0.72×10−8, 0.97×10−8, and 1.44×10−8 mutations per generation per site.34 We tested recombination rates of 5×10−9, 1×10−8, 1.5×10−8 (the default value for human data), 2×10−8, and 2×10−7 recombinations per generation per site.34 We ran the sampler for 3,000 iterations and saved the ancestral recombination graph from every tenth iteration. We used the functions heidel.diag and geweke.diag in the coda library of R, version 3.2.3, to assess convergence diagnostics on the basis of the posterior distribution of the number of recombination events.37 To convert generations into years, we assumed a generation interval of 28 years.38, 39
title	Material and Methods
sec	Ethics Statement This project was excluded from institutional-review-board review by the Office of Human Subjects Research Protections, National Institutes of Health (OHSRP ID# 17-NHGRI-00282).
title	Ethics Statement
p	This project was excluded from institutional-review-board review by the Office of Human Subjects Research Protections, National Institutes of Health (OHSRP ID# 17-NHGRI-00282).
sec	Sequence Data We retrieved whole-genome-sequence data from 2,504 individuals in the 1000 Genomes Project,25 320 individuals in the African Genome Variation Project,26 and 108 individuals from Qatar.27 As previously detailed by the 1000 Genomes Project, a three-stage approach was used to establish phased haplotypes: (1) given array-based genotype data and family information, SHAPEIT2 was used to estimate a scaffold of long-range phased haplotypes, (2) a combination of Beagle and SHAPEIT2 was used to jointly analyze the long-range phased haplotypes and di-allelic variants, and (3) MVNcall was used to place multi-allelic and structural variants onto the haplotype scaffold.25 The average switch error rate was 0.56%, and the average distance between phasing errors was 1,062 kb.25 For comparison, the length of the β-globin locus, including the locus control region, is 67 kb. All data were accessed with VCFtools version 0.1.14.28 We assessed linkage disequilibrium in the 1000 Genomes Project data by using the --hap-r2 function for phased haplotypes.
title	Sequence Data
p	We retrieved whole-genome-sequence data from 2,504 individuals in the 1000 Genomes Project,25 320 individuals in the African Genome Variation Project,26 and 108 individuals from Qatar.27 As previously detailed by the 1000 Genomes Project, a three-stage approach was used to establish phased haplotypes: (1) given array-based genotype data and family information, SHAPEIT2 was used to estimate a scaffold of long-range phased haplotypes, (2) a combination of Beagle and SHAPEIT2 was used to jointly analyze the long-range phased haplotypes and di-allelic variants, and (3) MVNcall was used to place multi-allelic and structural variants onto the haplotype scaffold.25 The average switch error rate was 0.56%, and the average distance between phasing errors was 1,062 kb.25 For comparison, the length of the β-globin locus, including the locus control region, is 67 kb. All data were accessed with VCFtools version 0.1.14.28 We assessed linkage disequilibrium in the 1000 Genomes Project data by using the --hap-r2 function for phased haplotypes.
sec	Measuring Multi-locus Association We calculated Cramér’s V for the association between two nominal variables. In our analyses, the number of levels equaled the number of haplotypes. V2 is the square of the mean canonical correlation and is equivalent to Pearson’s r2 if at least one variable has only two levels. We implemented a bias-corrected version of Cramér’s V given by V˜=φ˜2/m˜, φ˜2=max(0,φ2−((r−1)(c−1)/n−1)), φ2=χ2/n, m˜=min(r˜−1,c˜−1), r˜=r−((r−1)2/n−1), c˜=c−((c−1)2/n−1), where n is the sample size and χ2 is the chi-square statistic without a continuity correction for a contingency table with r rows and c columns.29
title	Measuring Multi-locus Association
p	We calculated Cramér’s V for the association between two nominal variables. In our analyses, the number of levels equaled the number of haplotypes. V2 is the square of the mean canonical correlation and is equivalent to Pearson’s r2 if at least one variable has only two levels. We implemented a bias-corrected version of Cramér’s V given by V˜=φ˜2/m˜, φ˜2=max(0,φ2−((r−1)(c−1)/n−1)), φ2=χ2/n, m˜=min(r˜−1,c˜−1), r˜=r−((r−1)2/n−1), c˜=c−((c−1)2/n−1), where n is the sample size and χ2 is the chi-square statistic without a continuity correction for a contingency table with r rows and c columns.29
sec	African Ancestry We used YFitter to call Y chromosome haplogroups30 and HaploFind to call mitochondrial DNA haplogroups.31 We used projection analysis in ADMIXTURE version 1.332 with a reference panel of 21 global ancestries33 to analyze autosomal ancestry. To determine standard errors for the proportions of ancestral components for each individual, we reran ADMIXTURE with the addition of 200 bootstrap replicates. Accounting for both within and between individual variances, we calculated the proportions for average ancestry by using inverse variance weights. We then calculated 95% confidence intervals for each ancestry and individual, zeroed out any average proportions for which the 95% confidence intervals included 0, and renormalized the remaining averages to sum to 1.
title	African Ancestry
p	We used YFitter to call Y chromosome haplogroups30 and HaploFind to call mitochondrial DNA haplogroups.31 We used projection analysis in ADMIXTURE version 1.332 with a reference panel of 21 global ancestries33 to analyze autosomal ancestry. To determine standard errors for the proportions of ancestral components for each individual, we reran ADMIXTURE with the addition of 200 bootstrap replicates. Accounting for both within and between individual variances, we calculated the proportions for average ancestry by using inverse variance weights. We then calculated 95% confidence intervals for each ancestry and individual, zeroed out any average proportions for which the 95% confidence intervals included 0, and renormalized the remaining averages to sum to 1.
sec	Balancing Selection Let the genotype frequencies of the sickle homozygote, heterozygote, and wild-type homozygote be p2, 2pq, and q2, respectively. Let the corresponding relative fitnesses be 0, 1+s, and 1, respectively.2 Then, at equilibrium, s=(p/1−2p). For each of the five continental African samples in the 1000 Genomes Project Phase 3 release version 5a, we estimated the effective population size Ne on the basis of the heterozygosities of all single-nucleotide polymorphisms (i.e., diallelic, triallelic, and quadrallelic) by assuming a mutation rate of 0.97×10−8 mutations per site per generation.34 We then took the harmonic mean of the five Ne estimates. Assuming one initial copy of the mutant allele, we simulated 1,000 generations under a combination of random genetic drift and balancing selection. We repeated this process 1,000 times.
title	Balancing Selection
p	Let the genotype frequencies of the sickle homozygote, heterozygote, and wild-type homozygote be p2, 2pq, and q2, respectively. Let the corresponding relative fitnesses be 0, 1+s, and 1, respectively.2 Then, at equilibrium, s=(p/1−2p). For each of the five continental African samples in the 1000 Genomes Project Phase 3 release version 5a, we estimated the effective population size Ne on the basis of the heterozygosities of all single-nucleotide polymorphisms (i.e., diallelic, triallelic, and quadrallelic) by assuming a mutation rate of 0.97×10−8 mutations per site per generation.34 We then took the harmonic mean of the five Ne estimates. Assuming one initial copy of the mutant allele, we simulated 1,000 generations under a combination of random genetic drift and balancing selection. We repeated this process 1,000 times.
sec	Phylogenetic Network Analysis Phylogenetic trees are based on models of evolution that assume a purely bifurcating process. This assumption is violated by several evolutionary processes, including recombination, gene conversion, and recurrent mutation. A single phylogenetic tree is unable to represent incompatible signals across sites. In contrast, a phylogenetic network is a more general form of graph that relaxes the assumption of bifurcation. Split decomposition is an inferential method that computes a set of incompatible splits from a given distance matrix. Trivial splits separate a set of taxa into two sets, one set containing a single taxon and the other set containing all other taxa, thereby defining terminal branches. Non-trivial splits separate a set of taxa into two sets, both containing at least two taxa, thereby defining internal branches. A split network depicts incompatible splits as parallel branches. We used the Neighbor-Net method implemented in SplitsTree version 4.13.1 to perform split decomposition analysis of haplotypes.35
title	Phylogenetic Network Analysis
p	Phylogenetic trees are based on models of evolution that assume a purely bifurcating process. This assumption is violated by several evolutionary processes, including recombination, gene conversion, and recurrent mutation. A single phylogenetic tree is unable to represent incompatible signals across sites. In contrast, a phylogenetic network is a more general form of graph that relaxes the assumption of bifurcation. Split decomposition is an inferential method that computes a set of incompatible splits from a given distance matrix. Trivial splits separate a set of taxa into two sets, one set containing a single taxon and the other set containing all other taxa, thereby defining terminal branches. Non-trivial splits separate a set of taxa into two sets, both containing at least two taxa, thereby defining internal branches. A split network depicts incompatible splits as parallel branches. We used the Neighbor-Net method implemented in SplitsTree version 4.13.1 to perform split decomposition analysis of haplotypes.35
sec	Inferring the Ancestral Recombination Graph We used ARGweaver to infer the ancestral recombination graph.36 ARGweaver is based on the standard coalescent model and is sensitive to balancing selection, such that regions under balancing selection have older times to the most recent common ancestor than comparable neutral regions. To account for uncertainty in both the mutation and recombination rates, we used a grid approach. We tested mutation rates of 0.72×10−8, 0.97×10−8, and 1.44×10−8 mutations per generation per site.34 We tested recombination rates of 5×10−9, 1×10−8, 1.5×10−8 (the default value for human data), 2×10−8, and 2×10−7 recombinations per generation per site.34 We ran the sampler for 3,000 iterations and saved the ancestral recombination graph from every tenth iteration. We used the functions heidel.diag and geweke.diag in the coda library of R, version 3.2.3, to assess convergence diagnostics on the basis of the posterior distribution of the number of recombination events.37 To convert generations into years, we assumed a generation interval of 28 years.38, 39
title	Inferring the Ancestral Recombination Graph
p	We used ARGweaver to infer the ancestral recombination graph.36 ARGweaver is based on the standard coalescent model and is sensitive to balancing selection, such that regions under balancing selection have older times to the most recent common ancestor than comparable neutral regions. To account for uncertainty in both the mutation and recombination rates, we used a grid approach. We tested mutation rates of 0.72×10−8, 0.97×10−8, and 1.44×10−8 mutations per generation per site.34 We tested recombination rates of 5×10−9, 1×10−8, 1.5×10−8 (the default value for human data), 2×10−8, and 2×10−7 recombinations per generation per site.34 We ran the sampler for 3,000 iterations and saved the ancestral recombination graph from every tenth iteration. We used the functions heidel.diag and geweke.diag in the coda library of R, version 3.2.3, to assess convergence diagnostics on the basis of the posterior distribution of the number of recombination events.37 To convert generations into years, we assumed a generation interval of 28 years.38, 39
sec	Results Molecular Mapping of Restriction Sites We mapped 15 restriction sites, including the seven canonical sites, to the reference human genome sequence. We identified 12 known markers that predict the presence or absence of ten of these sites. Of the canonical sites, we predicted 5′ ε HincII with rs3834466, Gγ1 HindIII with rs2070972, Aγ1 HindIII with rs28440105, and 3′ ψβ HincII with rs968857 (Table 1); similar results were described in a previously reported analysis of rs3834466, rs28440105, rs10128556, and rs968857.40 Pairwise correlation (measured via r2) between these variants and rs334 was weak to nonexistent (Table 1). On the basis of these four RFLP-predicting markers, we identified ten unique haplotypes. V˜2 between the set of ten haplotypes and rs334 was 0.079 (Figure 1). Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334. Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334. Distributions of βS and the Classical Haplotypes In the 1000 Genomes Project data, we identified 137 sickle carriers and 0 sickle homozygotes; we predicted the classical haplotypes for all 137 carriers (Table 2). The Benin haplotype was the predominant haplotype in the samples of Esan and Yoruba from Nigeria, the CAR haplotype was the predominant haplotype in the sample of Luhya from Kenya, and the Senegal haplotype was the predominant haplotype in the samples of Mende from Sierra Leone and Mandinka from the Gambia. There was limited representation of the Arabian/Indian (one) and Cameroon (two) haplotypes. The average sickle allele frequency was 12.0% and did not statistically differ among the five continental African samples (χ42=1.48, p=0.830; Table S1). The distribution of matrilines comprised 1 A2, 2 B2, 1 J2, 8 L0, 24 L1, 43 L2, 49 L3, 3 L4, 2 L5, 1 T2, and 3 U6 haplogroups. The distribution of patrilines comprised 2 A1a, 5 E1a, 54 E1b1a, 1 E1b1b, 2 E2b, 1 G2a, 1 I2a, and 2 R1b haplogroups. Of the 54 E1b1a, 29 were E1b1a1a1f and 23 were E1b1a1a1g. Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria. In the African Genome Variation Project data, we identified 14 sickle carriers in the Baganda and one sickle carrier in the Zulu. We predicted that all 15 of these individuals carried the CAR haplotype (Table 2). In the Qatar sample, we identified four sickle carriers, all with insufficient information to predict the haplotypes. Haplotype Classification Based on Linkage Disequilibrium We defined haplotypes centered on rs334 in the 504 continental Africans from the 1000 Genomes Project. We recorded pairwise linkage disequilibrium (LD) between rs334 and all phased, diallelic markers on chromosome 11 (Figure 2). The largest value of r2 with rs334 was 0.407. There was only one marker, rs149481026, with r2≥0.4. This one marker was more strongly associated with rs334 (V˜2=0.406) than the set of four RFLP-predicting markers was (Figure 1A). Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers. To strengthen the association with rs334, we investigated lower levels of r2. There were three markers with r2≥0.3: rs183055323, rs149481026, and rs73404549. These three markers ranged across 132.6 kb, including the entire β-globin cluster. This interval was substantially smaller than the average distance between phasing errors. On the basis of these three markers, we identified five unique haplotypes (Table 3). One haplotype contained all occurrences of the Arabian/Indian, Cameroon, and CAR haplotypes. This haplotype contained the ancestral allele at all three markers. The Senegal haplotype was distributed across all five haplotypes, and the Benin haplotype was distributed across four of the five haplotypes. V˜2 between these three markers and rs334 was 0.753 (Figure 1B). Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. To improve cross-classification with the Senegal and Benin haplotypes, we identified 27 markers with r2≥0.2. These markers extended across 725.3 kb, which is still less than the average distance between phasing errors. On the basis of these markers, we identified 59 unique haplotypes, of which 18 carried the sickle allele at rs334. A 19th sickle haplotype was observed once in the ACB sample, and a 20th sickle haplotype was observed once in the Baganda sample (Table 4). V˜2 between these 27 markers and rs334 was 0.728 (Figure 1C). The most common haplotype carried the ancestral allele at all 28 sites and accounted for 68.5% of all haplotypes in the continental Africans. Globally, the ancestral haplotype had a frequency of 91.9%, was the most frequent haplotype in all 26 samples in the 1000 Genomes Project, and was the only haplotype in 15 of those samples, including all ten samples from East Asia and Europe. 13 of the sickle haplotypes in the Baganda, the one in the Zulu, and all four in the Qatari were identical to HAP1, the haplotype most commonly designated CAR (Table 4). Additionally, the four Qatari carriers had (1) >7.8% autosomal African ancestry, (2) an African mitochondrial haplogroup, or (3) an African Y chromosome haplogroup. The three most common haplotypes (HAP1, HAP16, and HAP20) correlated primarily with the CAR, Benin, and Senegal haplotypes, respectively (Table 4). Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. Given that our sets of 1, 3, and 27 markers more strongly associated with rs334 than did the set of RFLP-predicting markers, we assessed association between our sets of markers and the set of RFLP-predicting markers. Our sets of markers were moderately associated with the set of RFLP-predicting markers when conditioned on the presence of βS and weakly associated when conditioned on the presence of βA (Figure 1). Bioinformatic Annotation Different haplotypes might be associated with different clinical phenotypes or disease severity.41 Using Ensembl and HaploReg version 4.1,42 we annotated each of the 27 markers in linkage disequilibrium at r2≥0.2 with rs334 (Table S1). Nine sites were marked on histones as promoters or enhancers, possibly implicating differential expression of HBB, HBD (MIM: 142000), HBE1 (MIM: 142100), or HBG2 (MIM: 142250). Three of these nine sites, rs334, rs73402608, and rs112336602, were also annotated as bound by proteins, possibly implicating the same four genes. In addition, rs1039215 is correlated with gene expression, most strongly with HBG2 (Table S2). Balancing Selection We modeled balancing selection assuming that the relative fitness of the βA/βA homozygote was 1, the relative fitness of the βS/βS homozygote was 0, and the relative fitness of the βA/βS heterozygote was 1+s. On the basis of the 504 continental Africans from the 1000 Genomes Project, we estimated that s=0.158, which was in agreement with previous estimates.14 Next, by assuming a single initial copy and an effective population size Ne=25,542, we modeled random genetic drift plus balancing selection to estimate how many generations it would take for an equilibrium frequency of 12.0% to be reached. We found that the mutant allele was lost 74.6% of the time and, conditional on reaching equilibrium, reached a frequency of 12.0% after an average of 87 (95% confidence interval [68,124]) generations, or approximately 2,400 (95% confidence interval from 1,900 to 3,500) years. We stress that this value is not the age of the sickle mutation nor the age since the onset of balancing selection but rather the time to reach a frequency of 12.0%. To determine the fate of the mutant allele in the absence of heterozygote advantage, we repeated the simulation while assuming s=0. We found that the mutant allele was lost after an average of 12 generations (95% confidence interval [1,92]), with a median of two generations. Phylogenetic Network Analysis We used split decomposition analysis to infer the phylogeny of the 20 sickle haplotypes, whereby the phylogeny was rooted by the ancestral haplotype. (Figure 3). The network revealed that the sickle mutation occurred once in the background of the ancestral haplotype and gave rise to HAP1, which is associated predominantly with the CAR haplotype. One cluster containing HAP1 and HAP10 accounted for all occurrences of the Cameroon, CAR, and Arabian/Indian haplotypes (Table 4). Two additional clusters were derived from this haplotype and not from the ancestral haplotype. One cluster contained HAP6, HAP9, HAP19, and HAP20; all associated with the Senegal haplotype (Table 4). In this cluster, the modal haplotype, HAP20, differed from HAP1 by 15 derived mutations (Table 4 and Table S1). This cluster accounted for 70.5% of occurrences of the Senegal haplotype. The other cluster contained the remaining 14 haplotypes, accounting for all occurrences of the Benin haplotype and 29.5% of the occurrences of the Senegal haplotype. In this cluster, the modal haplotype, HAP16, differed from HAP1 by 13 derived mutations (Table 4; Table S1). Other than the sickle mutation, only one derived mutation, at rs10655224, was shared between HAP16 and HAP20. The remaining 14 derived mutations in HAP20 had an average frequency of 11.3% in the Gambian in Western Divisions in the Gambia (GWD) and Mende in Sierra Leone (MSL) samples and 0.1% in the Esan in Nigeria (ESN) and Yoruba in Ibadan, Nigeria (YRI) samples (Table S1). Conversely, the remaining 12 derived mutations in HAP16 had an average frequency of 13.4% in the ESN and YRI samples and 2.2% in the GWD and MSL samples (Table S1). These 26 derived alleles had an average frequency of 0.4% in the Luhya in Webuye, Kenya (LWK) sample (Table S1). Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site. The Ancestral Recombination Graph Using coalescent theory, we sampled the posterior distribution of the ancestral recombination graph with the 1,008 haplotypes, including 121 sickle haplotypes, from the 504 continental Africans from the 1000 Genomes Project. Over a grid of 15 pairs of mutation and recombination rates, a mutation rate of 0.97×10−8 mutations per generation per site and a recombination rate of 1.5×10−8 recombinations per generation per site yielded the best fit to the data. Given these two rates, the trace of the posterior number of recombination events visually indicated convergence to a stationary distribution (Figure S1). More formally, Geweke’s diagnostic indicated that the sampled values came from a stationary distribution (p=0.063). Heidelberger and Welch’s diagnostic also indicated that the sampled values came from a stationary distribution (p=0.163) and further indicated that there was no need to discard initial iterations. We estimated the age of the sickle mutation as 259 (95% credible interval [123,395]) generations, or approximately 7,300 years (95% credible interval from 3,400 to 11,100 years).
title	Results
sec	Molecular Mapping of Restriction Sites We mapped 15 restriction sites, including the seven canonical sites, to the reference human genome sequence. We identified 12 known markers that predict the presence or absence of ten of these sites. Of the canonical sites, we predicted 5′ ε HincII with rs3834466, Gγ1 HindIII with rs2070972, Aγ1 HindIII with rs28440105, and 3′ ψβ HincII with rs968857 (Table 1); similar results were described in a previously reported analysis of rs3834466, rs28440105, rs10128556, and rs968857.40 Pairwise correlation (measured via r2) between these variants and rs334 was weak to nonexistent (Table 1). On the basis of these four RFLP-predicting markers, we identified ten unique haplotypes. V˜2 between the set of ten haplotypes and rs334 was 0.079 (Figure 1). Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334. Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334.
title	Molecular Mapping of Restriction Sites
p	We mapped 15 restriction sites, including the seven canonical sites, to the reference human genome sequence. We identified 12 known markers that predict the presence or absence of ten of these sites. Of the canonical sites, we predicted 5′ ε HincII with rs3834466, Gγ1 HindIII with rs2070972, Aγ1 HindIII with rs28440105, and 3′ ψβ HincII with rs968857 (Table 1); similar results were described in a previously reported analysis of rs3834466, rs28440105, rs10128556, and rs968857.40 Pairwise correlation (measured via r2) between these variants and rs334 was weak to nonexistent (Table 1). On the basis of these four RFLP-predicting markers, we identified ten unique haplotypes. V˜2 between the set of ten haplotypes and rs334 was 0.079 (Figure 1). Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334. Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334.
figure	Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334.
label	Figure 1
caption	Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334.
p	Pairwise Association Plots
p	For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele.
p	(A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334.
p	(B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334.
p	(C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334.
table-wrap	Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334.
label	Table 1
caption	Molecular Characterization of the Classical Sickle Haplotypes
p	Molecular Characterization of the Classical Sickle Haplotypes
table	Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + +
tr	Site
th	Site
tr	3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII
th	3′ ψβ HincII
th	Aγ1 HindIII
th	Gγ1 HindIII
th	5′ ε HincII
tr	Sequencea GTTGAC AAGCTT AAGCTT GTTGAC
td	Sequencea
td	GTTGAC
td	AAGCTT
td	AAGCTT
td	GTTGAC
tr	Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567
td	Range
td	5,260,457–5,260,462
td	5,269,799–5,269,804
td	5,274,717–5,274,722
td	5,291,563–5,291,567
tr	rsID rs968857 rs28440105 rs2070972 rs3834466
td	rsID
td	rs968857
td	rs28440105
td	rs2070972
td	rs3834466
tr	Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564
td	Position (hg19)
td	5,260,458
td	5,269,799
td	5,274,717
td	5,291,563–5,291,564
tr	Senegal + − + −
td	Senegal
td	+
td	−
td	+
td	−
tr	Benin + − − −
td	Benin
td	+
td	−
td	−
td	−
tr	CAR − − + −
td	CAR
td	−
td	−
td	+
td	−
tr	Cameroon + + + −
td	Cameroon
td	+
td	+
td	+
td	−
tr	Arabian/Indian + − + +
td	Arabian/Indian
td	+
td	−
td	+
td	+
tr	r2b 0.000 0.016 0.003 0.020
td	r2b
td	0.000
td	0.016
td	0.003
td	0.020
tr	D′b −0.104 0.930 0.094 −0.853
td	D′b
td	−0.104
td	0.930
td	0.094
td	−0.853
tr	Ancestral T C C G
td	Ancestral
td	T
td	C
td	C
td	G
tr	Status + − − −
td	Status
td	+
td	−
td	−
td	−
tr	Derived C A A GT
td	Derived
td	C
td	A
td	A
td	GT
tr	Status − + + +
td	Status
td	−
td	+
td	+
td	+
table-wrap-foot	a Underlining indicates the polymorphic position.
footnote	a Underlining indicates the polymorphic position.
label	a
p	Underlining indicates the polymorphic position.
table-wrap-foot	b Pairwise linkage disequilibrium values are shown with respect to rs334.
footnote	b Pairwise linkage disequilibrium values are shown with respect to rs334.
label	b
p	Pairwise linkage disequilibrium values are shown with respect to rs334.
sec	Distributions of βS and the Classical Haplotypes In the 1000 Genomes Project data, we identified 137 sickle carriers and 0 sickle homozygotes; we predicted the classical haplotypes for all 137 carriers (Table 2). The Benin haplotype was the predominant haplotype in the samples of Esan and Yoruba from Nigeria, the CAR haplotype was the predominant haplotype in the sample of Luhya from Kenya, and the Senegal haplotype was the predominant haplotype in the samples of Mende from Sierra Leone and Mandinka from the Gambia. There was limited representation of the Arabian/Indian (one) and Cameroon (two) haplotypes. The average sickle allele frequency was 12.0% and did not statistically differ among the five continental African samples (χ42=1.48, p=0.830; Table S1). The distribution of matrilines comprised 1 A2, 2 B2, 1 J2, 8 L0, 24 L1, 43 L2, 49 L3, 3 L4, 2 L5, 1 T2, and 3 U6 haplogroups. The distribution of patrilines comprised 2 A1a, 5 E1a, 54 E1b1a, 1 E1b1b, 2 E2b, 1 G2a, 1 I2a, and 2 R1b haplogroups. Of the 54 E1b1a, 29 were E1b1a1a1f and 23 were E1b1a1a1g. Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria. In the African Genome Variation Project data, we identified 14 sickle carriers in the Baganda and one sickle carrier in the Zulu. We predicted that all 15 of these individuals carried the CAR haplotype (Table 2). In the Qatar sample, we identified four sickle carriers, all with insufficient information to predict the haplotypes.
title	Distributions of βS and the Classical Haplotypes
p	In the 1000 Genomes Project data, we identified 137 sickle carriers and 0 sickle homozygotes; we predicted the classical haplotypes for all 137 carriers (Table 2). The Benin haplotype was the predominant haplotype in the samples of Esan and Yoruba from Nigeria, the CAR haplotype was the predominant haplotype in the sample of Luhya from Kenya, and the Senegal haplotype was the predominant haplotype in the samples of Mende from Sierra Leone and Mandinka from the Gambia. There was limited representation of the Arabian/Indian (one) and Cameroon (two) haplotypes. The average sickle allele frequency was 12.0% and did not statistically differ among the five continental African samples (χ42=1.48, p=0.830; Table S1). The distribution of matrilines comprised 1 A2, 2 B2, 1 J2, 8 L0, 24 L1, 43 L2, 49 L3, 3 L4, 2 L5, 1 T2, and 3 U6 haplogroups. The distribution of patrilines comprised 2 A1a, 5 E1a, 54 E1b1a, 1 E1b1b, 2 E2b, 1 G2a, 1 I2a, and 2 R1b haplogroups. Of the 54 E1b1a, 29 were E1b1a1a1f and 23 were E1b1a1a1g. Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria.
table-wrap	Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria.
label	Table 2
caption	Distribution of Classical Sickle Haplotypes
p	Distribution of Classical Sickle Haplotypes
table	Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0
tr	Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical
th	Samplea
th	Arabian/Indian
th	Benin
th	Cameroon
th	CAR
th	Senegal
th	Atypical
tr	ACB 0 4 0 2 3 0
td	ACB
td	0
td	4
td	0
td	2
td	3
td	0
tr	ASW 0 1 1 0 0 0
td	ASW
td	0
td	1
td	1
td	0
td	0
td	0
tr	CLM 0 0 0 1 0 1
td	CLM
td	0
td	0
td	0
td	1
td	0
td	1
tr	ESN 0 18 1 0 5 0
td	ESN
td	0
td	18
td	1
td	0
td	5
td	0
tr	GWD 0 2 0 0 24 0
td	GWD
td	0
td	2
td	0
td	0
td	24
td	0
tr	LWK 1 0 0 19 0 0
td	LWK
td	1
td	0
td	0
td	19
td	0
td	0
tr	MSL 0 3 0 0 17 1
td	MSL
td	0
td	3
td	0
td	0
td	17
td	1
tr	PUR 0 0 0 1 2 0
td	PUR
td	0
td	0
td	0
td	1
td	2
td	0
tr	YRI 0 19 0 0 10 1
td	YRI
td	0
td	19
td	0
td	0
td	10
td	1
tr	Baganda 0 0 0 14 0 0
td	Baganda
td	0
td	0
td	0
td	14
td	0
td	0
tr	Zulu 0 0 0 1 0 0
td	Zulu
td	0
td	0
td	0
td	1
td	0
td	0
table-wrap-foot	a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria.
footnote	a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria.
label	a
p	The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria.
p	In the African Genome Variation Project data, we identified 14 sickle carriers in the Baganda and one sickle carrier in the Zulu. We predicted that all 15 of these individuals carried the CAR haplotype (Table 2). In the Qatar sample, we identified four sickle carriers, all with insufficient information to predict the haplotypes.
sec	Haplotype Classification Based on Linkage Disequilibrium We defined haplotypes centered on rs334 in the 504 continental Africans from the 1000 Genomes Project. We recorded pairwise linkage disequilibrium (LD) between rs334 and all phased, diallelic markers on chromosome 11 (Figure 2). The largest value of r2 with rs334 was 0.407. There was only one marker, rs149481026, with r2≥0.4. This one marker was more strongly associated with rs334 (V˜2=0.406) than the set of four RFLP-predicting markers was (Figure 1A). Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers. To strengthen the association with rs334, we investigated lower levels of r2. There were three markers with r2≥0.3: rs183055323, rs149481026, and rs73404549. These three markers ranged across 132.6 kb, including the entire β-globin cluster. This interval was substantially smaller than the average distance between phasing errors. On the basis of these three markers, we identified five unique haplotypes (Table 3). One haplotype contained all occurrences of the Arabian/Indian, Cameroon, and CAR haplotypes. This haplotype contained the ancestral allele at all three markers. The Senegal haplotype was distributed across all five haplotypes, and the Benin haplotype was distributed across four of the five haplotypes. V˜2 between these three markers and rs334 was 0.753 (Figure 1B). Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. To improve cross-classification with the Senegal and Benin haplotypes, we identified 27 markers with r2≥0.2. These markers extended across 725.3 kb, which is still less than the average distance between phasing errors. On the basis of these markers, we identified 59 unique haplotypes, of which 18 carried the sickle allele at rs334. A 19th sickle haplotype was observed once in the ACB sample, and a 20th sickle haplotype was observed once in the Baganda sample (Table 4). V˜2 between these 27 markers and rs334 was 0.728 (Figure 1C). The most common haplotype carried the ancestral allele at all 28 sites and accounted for 68.5% of all haplotypes in the continental Africans. Globally, the ancestral haplotype had a frequency of 91.9%, was the most frequent haplotype in all 26 samples in the 1000 Genomes Project, and was the only haplotype in 15 of those samples, including all ten samples from East Asia and Europe. 13 of the sickle haplotypes in the Baganda, the one in the Zulu, and all four in the Qatari were identical to HAP1, the haplotype most commonly designated CAR (Table 4). Additionally, the four Qatari carriers had (1) >7.8% autosomal African ancestry, (2) an African mitochondrial haplogroup, or (3) an African Y chromosome haplogroup. The three most common haplotypes (HAP1, HAP16, and HAP20) correlated primarily with the CAR, Benin, and Senegal haplotypes, respectively (Table 4). Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. Given that our sets of 1, 3, and 27 markers more strongly associated with rs334 than did the set of RFLP-predicting markers, we assessed association between our sets of markers and the set of RFLP-predicting markers. Our sets of markers were moderately associated with the set of RFLP-predicting markers when conditioned on the presence of βS and weakly associated when conditioned on the presence of βA (Figure 1).
title	Haplotype Classification Based on Linkage Disequilibrium
p	We defined haplotypes centered on rs334 in the 504 continental Africans from the 1000 Genomes Project. We recorded pairwise linkage disequilibrium (LD) between rs334 and all phased, diallelic markers on chromosome 11 (Figure 2). The largest value of r2 with rs334 was 0.407. There was only one marker, rs149481026, with r2≥0.4. This one marker was more strongly associated with rs334 (V˜2=0.406) than the set of four RFLP-predicting markers was (Figure 1A). Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers.
figure	Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers.
label	Figure 2
caption	Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers.
p	Linkage Disequilibrium with rs334
p	We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers.
p	To strengthen the association with rs334, we investigated lower levels of r2. There were three markers with r2≥0.3: rs183055323, rs149481026, and rs73404549. These three markers ranged across 132.6 kb, including the entire β-globin cluster. This interval was substantially smaller than the average distance between phasing errors. On the basis of these three markers, we identified five unique haplotypes (Table 3). One haplotype contained all occurrences of the Arabian/Indian, Cameroon, and CAR haplotypes. This haplotype contained the ancestral allele at all three markers. The Senegal haplotype was distributed across all five haplotypes, and the Benin haplotype was distributed across four of the five haplotypes. V˜2 between these three markers and rs334 was 0.753 (Figure 1B). Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
table-wrap	Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
label	Table 3
caption	Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers
p	Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers
table	Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1
tr	Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical
th	Name
th	Haplotypea
th	Arabian/Indian
th	Benin
th	Cameroon
th	CAR
th	Senegal
th	Atypical
tr	Ancestor 0000 NA NA NA NA NA NA
td	Ancestor
td	0000
td	NA
td	NA
td	NA
td	NA
td	NA
td	NA
tr	HAPA 0010 1 4 2 38 1 1
td	HAPA
td	0010
td	1
td	4
td	2
td	38
td	1
td	1
tr	HAPB 0011 0 2 0 0 1 0
td	HAPB
td	0011
td	0
td	2
td	0
td	0
td	1
td	0
tr	HAPC 0110 0 1 0 0 1 1
td	HAPC
td	0110
td	0
td	1
td	0
td	0
td	1
td	1
tr	HAPD 0111 0 40 0 0 15 0
td	HAPD
td	0111
td	0
td	40
td	0
td	0
td	15
td	0
tr	HAPE 1010 0 0 0 0 43 1
td	HAPE
td	1010
td	0
td	0
td	0
td	0
td	43
td	1
table-wrap-foot	a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
footnote	a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
label	a
p	0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
p	To improve cross-classification with the Senegal and Benin haplotypes, we identified 27 markers with r2≥0.2. These markers extended across 725.3 kb, which is still less than the average distance between phasing errors. On the basis of these markers, we identified 59 unique haplotypes, of which 18 carried the sickle allele at rs334. A 19th sickle haplotype was observed once in the ACB sample, and a 20th sickle haplotype was observed once in the Baganda sample (Table 4). V˜2 between these 27 markers and rs334 was 0.728 (Figure 1C). The most common haplotype carried the ancestral allele at all 28 sites and accounted for 68.5% of all haplotypes in the continental Africans. Globally, the ancestral haplotype had a frequency of 91.9%, was the most frequent haplotype in all 26 samples in the 1000 Genomes Project, and was the only haplotype in 15 of those samples, including all ten samples from East Asia and Europe. 13 of the sickle haplotypes in the Baganda, the one in the Zulu, and all four in the Qatari were identical to HAP1, the haplotype most commonly designated CAR (Table 4). Additionally, the four Qatari carriers had (1) >7.8% autosomal African ancestry, (2) an African mitochondrial haplogroup, or (3) an African Y chromosome haplogroup. The three most common haplotypes (HAP1, HAP16, and HAP20) correlated primarily with the CAR, Benin, and Senegal haplotypes, respectively (Table 4). Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
table-wrap	Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
label	Table 4
caption	Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers
p	Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers
table	Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0
tr	Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical
th	Name
th	Haplotypea
th	Arabian/Indian
th	Benin
th	Cameroon
th	CAR
th	Senegal
th	Atypical
tr	Ancestor 0000000000000000000000000010 NA NA NA NA NA NA
td	Ancestor
td	0000000000000000000000000010
td	NA
td	NA
td	NA
td	NA
td	NA
td	NA
tr	HAP1 0000000000000000000010000010 1 0 2 37 0 1
td	HAP1
td	0000000000000000000010000010
td	1
td	0
td	2
td	37
td	0
td	1
tr	HAP2 0000000000000000000010110010 0 4 0 0 0 0
td	HAP2
td	0000000000000000000010110010
td	0
td	4
td	0
td	0
td	0
td	0
tr	HAP3 0000000000000000000010110101 0 0 0 0 1 0
td	HAP3
td	0000000000000000000010110101
td	0
td	0
td	0
td	0
td	1
td	0
tr	HAP4 0000000000000000001110110101 0 2 0 0 0 0
td	HAP4
td	0000000000000000001110110101
td	0
td	2
td	0
td	0
td	0
td	0
tr	HAP5 0000000000000000111110110101 0 5 0 0 0 0
td	HAP5
td	0000000000000000111110110101
td	0
td	5
td	0
td	0
td	0
td	0
tr	HAP6 0000000000000001100011001010 0 0 0 0 3 0
td	HAP6
td	0000000000000001100011001010
td	0
td	0
td	0
td	0
td	3
td	0
tr	HAP7 0000000100000100111110110101 0 1 0 0 0 0
td	HAP7
td	0000000100000100111110110101
td	0
td	1
td	0
td	0
td	0
td	0
tr	HAP8 0000000100000100111110110110 0 0 0 0 1 0
td	HAP8
td	0000000100000100111110110110
td	0
td	0
td	0
td	0
td	1
td	0
tr	HAP9 0000111011111011100011001010 0 0 0 0 1 0
td	HAP9
td	0000111011111011100011001010
td	0
td	0
td	0
td	0
td	1
td	0
tr	HAP10 0001000000000000000010000010 0 0 0 1 0 0
td	HAP10
td	0001000000000000000010000010
td	0
td	0
td	0
td	1
td	0
td	0
tr	HAP11 0001000100000100111110110101 0 2 0 0 1 0
td	HAP11
td	0001000100000100111110110101
td	0
td	2
td	0
td	0
td	1
td	0
tr	HAP12 0011000100000100101110000010 0 0 0 0 1 0
td	HAP12
td	0011000100000100101110000010
td	0
td	0
td	0
td	0
td	1
td	0
tr	HAP13 0011000100000100111110000010 0 1 0 0 0 1
td	HAP13
td	0011000100000100111110000010
td	0
td	1
td	0
td	0
td	0
td	1
tr	HAP14 0011000100000100111110110010 0 0 0 0 1 0
td	HAP14
td	0011000100000100111110110010
td	0
td	0
td	0
td	0
td	1
td	0
tr	HAP15 0011000100000100111110110100 0 2 0 0 1 0
td	HAP15
td	0011000100000100111110110100
td	0
td	2
td	0
td	0
td	1
td	0
tr	HAP16 0011000100000100111110110101 0 23 0 0 7 0
td	HAP16
td	0011000100000100111110110101
td	0
td	23
td	0
td	0
td	7
td	0
tr	HAP17 0011000100000100111110110110 0 6 0 0 5 0
td	HAP17
td	0011000100000100111110110110
td	0
td	6
td	0
td	0
td	5
td	0
tr	HAP18 0011000100000100111110110111 0 1 0 0 0 0
td	HAP18
td	0011000100000100111110110111
td	0
td	1
td	0
td	0
td	0
td	0
tr	HAP19 1100111011111011100010000010 0 0 0 0 3 1
td	HAP19
td	1100111011111011100010000010
td	0
td	0
td	0
td	0
td	3
td	1
tr	HAP20 1100111011111011100011001010 0 0 0 0 36 0
td	HAP20
td	1100111011111011100011001010
td	0
td	0
td	0
td	0
td	36
td	0
table-wrap-foot	a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
footnote	a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
label	a
p	0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined.
p	Given that our sets of 1, 3, and 27 markers more strongly associated with rs334 than did the set of RFLP-predicting markers, we assessed association between our sets of markers and the set of RFLP-predicting markers. Our sets of markers were moderately associated with the set of RFLP-predicting markers when conditioned on the presence of βS and weakly associated when conditioned on the presence of βA (Figure 1).
sec	Bioinformatic Annotation Different haplotypes might be associated with different clinical phenotypes or disease severity.41 Using Ensembl and HaploReg version 4.1,42 we annotated each of the 27 markers in linkage disequilibrium at r2≥0.2 with rs334 (Table S1). Nine sites were marked on histones as promoters or enhancers, possibly implicating differential expression of HBB, HBD (MIM: 142000), HBE1 (MIM: 142100), or HBG2 (MIM: 142250). Three of these nine sites, rs334, rs73402608, and rs112336602, were also annotated as bound by proteins, possibly implicating the same four genes. In addition, rs1039215 is correlated with gene expression, most strongly with HBG2 (Table S2).
title	Bioinformatic Annotation
p	Different haplotypes might be associated with different clinical phenotypes or disease severity.41 Using Ensembl and HaploReg version 4.1,42 we annotated each of the 27 markers in linkage disequilibrium at r2≥0.2 with rs334 (Table S1). Nine sites were marked on histones as promoters or enhancers, possibly implicating differential expression of HBB, HBD (MIM: 142000), HBE1 (MIM: 142100), or HBG2 (MIM: 142250). Three of these nine sites, rs334, rs73402608, and rs112336602, were also annotated as bound by proteins, possibly implicating the same four genes. In addition, rs1039215 is correlated with gene expression, most strongly with HBG2 (Table S2).
sec	Balancing Selection We modeled balancing selection assuming that the relative fitness of the βA/βA homozygote was 1, the relative fitness of the βS/βS homozygote was 0, and the relative fitness of the βA/βS heterozygote was 1+s. On the basis of the 504 continental Africans from the 1000 Genomes Project, we estimated that s=0.158, which was in agreement with previous estimates.14 Next, by assuming a single initial copy and an effective population size Ne=25,542, we modeled random genetic drift plus balancing selection to estimate how many generations it would take for an equilibrium frequency of 12.0% to be reached. We found that the mutant allele was lost 74.6% of the time and, conditional on reaching equilibrium, reached a frequency of 12.0% after an average of 87 (95% confidence interval [68,124]) generations, or approximately 2,400 (95% confidence interval from 1,900 to 3,500) years. We stress that this value is not the age of the sickle mutation nor the age since the onset of balancing selection but rather the time to reach a frequency of 12.0%. To determine the fate of the mutant allele in the absence of heterozygote advantage, we repeated the simulation while assuming s=0. We found that the mutant allele was lost after an average of 12 generations (95% confidence interval [1,92]), with a median of two generations.
title	Balancing Selection
p	We modeled balancing selection assuming that the relative fitness of the βA/βA homozygote was 1, the relative fitness of the βS/βS homozygote was 0, and the relative fitness of the βA/βS heterozygote was 1+s. On the basis of the 504 continental Africans from the 1000 Genomes Project, we estimated that s=0.158, which was in agreement with previous estimates.14 Next, by assuming a single initial copy and an effective population size Ne=25,542, we modeled random genetic drift plus balancing selection to estimate how many generations it would take for an equilibrium frequency of 12.0% to be reached. We found that the mutant allele was lost 74.6% of the time and, conditional on reaching equilibrium, reached a frequency of 12.0% after an average of 87 (95% confidence interval [68,124]) generations, or approximately 2,400 (95% confidence interval from 1,900 to 3,500) years. We stress that this value is not the age of the sickle mutation nor the age since the onset of balancing selection but rather the time to reach a frequency of 12.0%. To determine the fate of the mutant allele in the absence of heterozygote advantage, we repeated the simulation while assuming s=0. We found that the mutant allele was lost after an average of 12 generations (95% confidence interval [1,92]), with a median of two generations.
sec	Phylogenetic Network Analysis We used split decomposition analysis to infer the phylogeny of the 20 sickle haplotypes, whereby the phylogeny was rooted by the ancestral haplotype. (Figure 3). The network revealed that the sickle mutation occurred once in the background of the ancestral haplotype and gave rise to HAP1, which is associated predominantly with the CAR haplotype. One cluster containing HAP1 and HAP10 accounted for all occurrences of the Cameroon, CAR, and Arabian/Indian haplotypes (Table 4). Two additional clusters were derived from this haplotype and not from the ancestral haplotype. One cluster contained HAP6, HAP9, HAP19, and HAP20; all associated with the Senegal haplotype (Table 4). In this cluster, the modal haplotype, HAP20, differed from HAP1 by 15 derived mutations (Table 4 and Table S1). This cluster accounted for 70.5% of occurrences of the Senegal haplotype. The other cluster contained the remaining 14 haplotypes, accounting for all occurrences of the Benin haplotype and 29.5% of the occurrences of the Senegal haplotype. In this cluster, the modal haplotype, HAP16, differed from HAP1 by 13 derived mutations (Table 4; Table S1). Other than the sickle mutation, only one derived mutation, at rs10655224, was shared between HAP16 and HAP20. The remaining 14 derived mutations in HAP20 had an average frequency of 11.3% in the Gambian in Western Divisions in the Gambia (GWD) and Mende in Sierra Leone (MSL) samples and 0.1% in the Esan in Nigeria (ESN) and Yoruba in Ibadan, Nigeria (YRI) samples (Table S1). Conversely, the remaining 12 derived mutations in HAP16 had an average frequency of 13.4% in the ESN and YRI samples and 2.2% in the GWD and MSL samples (Table S1). These 26 derived alleles had an average frequency of 0.4% in the Luhya in Webuye, Kenya (LWK) sample (Table S1). Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site.
title	Phylogenetic Network Analysis
p	We used split decomposition analysis to infer the phylogeny of the 20 sickle haplotypes, whereby the phylogeny was rooted by the ancestral haplotype. (Figure 3). The network revealed that the sickle mutation occurred once in the background of the ancestral haplotype and gave rise to HAP1, which is associated predominantly with the CAR haplotype. One cluster containing HAP1 and HAP10 accounted for all occurrences of the Cameroon, CAR, and Arabian/Indian haplotypes (Table 4). Two additional clusters were derived from this haplotype and not from the ancestral haplotype. One cluster contained HAP6, HAP9, HAP19, and HAP20; all associated with the Senegal haplotype (Table 4). In this cluster, the modal haplotype, HAP20, differed from HAP1 by 15 derived mutations (Table 4 and Table S1). This cluster accounted for 70.5% of occurrences of the Senegal haplotype. The other cluster contained the remaining 14 haplotypes, accounting for all occurrences of the Benin haplotype and 29.5% of the occurrences of the Senegal haplotype. In this cluster, the modal haplotype, HAP16, differed from HAP1 by 13 derived mutations (Table 4; Table S1). Other than the sickle mutation, only one derived mutation, at rs10655224, was shared between HAP16 and HAP20. The remaining 14 derived mutations in HAP20 had an average frequency of 11.3% in the Gambian in Western Divisions in the Gambia (GWD) and Mende in Sierra Leone (MSL) samples and 0.1% in the Esan in Nigeria (ESN) and Yoruba in Ibadan, Nigeria (YRI) samples (Table S1). Conversely, the remaining 12 derived mutations in HAP16 had an average frequency of 13.4% in the ESN and YRI samples and 2.2% in the GWD and MSL samples (Table S1). These 26 derived alleles had an average frequency of 0.4% in the Luhya in Webuye, Kenya (LWK) sample (Table S1). Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site.
figure	Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site.
label	Figure 3
caption	Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site.
p	Split Decomposition Networks of Sickle-Carrying Haplotypes
p	(A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site.
p	(B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site.
sec	The Ancestral Recombination Graph Using coalescent theory, we sampled the posterior distribution of the ancestral recombination graph with the 1,008 haplotypes, including 121 sickle haplotypes, from the 504 continental Africans from the 1000 Genomes Project. Over a grid of 15 pairs of mutation and recombination rates, a mutation rate of 0.97×10−8 mutations per generation per site and a recombination rate of 1.5×10−8 recombinations per generation per site yielded the best fit to the data. Given these two rates, the trace of the posterior number of recombination events visually indicated convergence to a stationary distribution (Figure S1). More formally, Geweke’s diagnostic indicated that the sampled values came from a stationary distribution (p=0.063). Heidelberger and Welch’s diagnostic also indicated that the sampled values came from a stationary distribution (p=0.163) and further indicated that there was no need to discard initial iterations. We estimated the age of the sickle mutation as 259 (95% credible interval [123,395]) generations, or approximately 7,300 years (95% credible interval from 3,400 to 11,100 years).
title	The Ancestral Recombination Graph
p	Using coalescent theory, we sampled the posterior distribution of the ancestral recombination graph with the 1,008 haplotypes, including 121 sickle haplotypes, from the 504 continental Africans from the 1000 Genomes Project. Over a grid of 15 pairs of mutation and recombination rates, a mutation rate of 0.97×10−8 mutations per generation per site and a recombination rate of 1.5×10−8 recombinations per generation per site yielded the best fit to the data. Given these two rates, the trace of the posterior number of recombination events visually indicated convergence to a stationary distribution (Figure S1). More formally, Geweke’s diagnostic indicated that the sampled values came from a stationary distribution (p=0.063). Heidelberger and Welch’s diagnostic also indicated that the sampled values came from a stationary distribution (p=0.163) and further indicated that there was no need to discard initial iterations. We estimated the age of the sickle mutation as 259 (95% credible interval [123,395]) generations, or approximately 7,300 years (95% credible interval from 3,400 to 11,100 years).
sec	Discussion There are two models of the origins of the sickle allele. The multicentric model posits five independent occurrences of the same mutation within the last few thousand years. The unicentric model posits a single occurrence and an older age. We used whole-genome-sequence data to provide insight into this issue. Using haplotypic classification and phylogenetic network analysis, we found clear and consistent evidence for a single origin of the sickle mutation. After accounting for recombination, we estimated that the sickle mutation is 259 [123,395] generations old. Starting from first principles of population genetics, we defined haplotypes on the basis of phased sequence data. In contrast, the classical haplotypes were based on patterns of restriction sites seen in individuals with sickle cell anemia.3, 4 By molecularly mapping restriction sites to the sequence data, we found that the restriction sites correlated poorly with rs334, such that none of the canonical sites were included in our sequence-based haplotypes. This discrepancy can be explained by the fact that the RFLP-defining markers circulate on both βS- and βA-carrying chromosomes, even though the restriction sites were originally used for defining types of βS-carrying chromosomes. In contrast, our haplotypes correlate strongly with rs334 and simultaneously correlate moderately with the restriction sites conditional on the presence of the βS allele, thereby capturing the classical patterns. There are four lines of evidence regarding the age of the sickle mutation: historical data, the simulations of balancing selection, the patriline data, and the analysis of the ancestral recombination graph. Although historical records are sparse, the earliest recorded cases of fevers that could have been caused by malaria were ∼5,000 years ago in China and were possibly due to Plasmodium vivax.2, 43, 44 The earliest recorded case of illness that could have been malaria specifically caused by Plasmodium falciparum could have been ∼4,000 years ago in Egypt and Sumer; however, Plasmodium falciparum could have been present in Africa several thousand years earlier.2 The first recorded cases of sickle cell anemia or, more broadly, sickle cell disease were in Egypt during the predynastic period (∼3200 BC45), in the Persian Gulf during the Hellenistic period (2,130 years before present46), and in Ghana in 1670 AD.47 The existence of the sickle allele in predynastic Egypt constrains the lower bound of the age of the sickle mutation to be 5,200 years. On the basis of these limited data, it is historically plausible that the selective environment preceded the sickle mutation, consistent with our balancing-selection simulations, which showed that the sickle allele would have been lost almost immediately without a heterozygote advantage (under the assumption of recessive lethality). The patriline data corroborate the age of the sickle mutation. Given a single origin, the sickle mutation most likely arose either in a population containing both E1b1a1a1f-L485 (also known as E1b1a1a1a1c-L485) and E1b1a1a1g-U175 (also known as E1b1a1a1a2a-U175) or in a population containing their common ancestor. Thus, the time of origin of the common ancestor of these two haplogroups is a plausible upper bound on the age of the sickle mutation. The common ancestor E1b1a1a1a-M4732 arose 10,500 (95% confidence interval between 9,200 and 12,000) years ago, consistent with the upper credible interval of 11,100 years ago that we estimated for the age of the sickle mutation. The Bantu expansions started ∼5,000 years ago and took ∼2,000 years to cross from present-day Cameroon to the Great Lakes region.48 The wide credible interval (3,400 to 11,100 years) of the age of the sickle mutation does not rule out the possibility that the mutation coincided with the onset of the Bantu expansions. However, our results support the hypothesis that the origin of the βS allele predated the onset of the Bantu expansions. First, the βS allele could have been at or near its equilibrium frequency when the Bantu expansions occurred ∼5,000 years ago, according to our estimation that the sickle mutation originated ∼7,300 years ago and that balancing selection ensued immediately and reached equilibrium in ∼2,400 years in an environment where malaria preexisted. Second, the likelihood that the βS allele is carried by migrating individuals increases with the frequency of that allele in the source population. Our results provide some suggestions as to where the sickle mutation might have originated. Descendants of the Y chromosome haplogroup E1b1a-V38 migrated across the Sahara from east to west,49 possibly around 19,000 years ago.50 E1b1a1-M2 most likely did not originate in eastern or northeastern Africa, but where it originated in either western or central Africa is unclear.49 Accordingly, the sickle mutation most likely did not occur in eastern or northeastern Africa. Our results indicate that the origin of the sickle mutation was in the middle of the Holocene Wet Phase, or Neolithic Subpluvial, which lasted from ∼7,500–7,000 BC to ∼3,500–3,000 BC. This time was the most recent of the Green Sahara periods, during which the Sahara experienced wet and rainy conditions.51 Our results thus support the Green Sahara as a possible place of origin of the sickle mutation. An alternative hypothesis is that the sickle allele arose in west-central Africa,13, 52 possibly in the northwestern portion of the equatorial rainforest. Our results also provide some suggestions as to where the three clusters might have originated. Two splits occurred early in the original βS-carrying population. The first split defined one cluster containing HAP1 and accounting for the Cameroon and CAR haplotypes. It is plausible that HAP1 was carried from an area in or around present-day Cameroon to the area that is presently the CAR, as well as to areas east and south, as part of the Bantu expansions. However, the Bantu expansions did not extend west and north. The second split subsequently separated the clusters containing HAP16 and HAP20, the modal haplotypes accounting for the Benin and Senegal haplotypes, respectively. HAP16 and HAP20 shared one derived mutation, consistent with an early split. Furthermore, given the subsequent accumulation of derived mutations not shared between HAP16 and HAP20, effectively no gene flow occurred between these two descendant populations, consistent with geographic separation. We therefore hypothesize that the common ancestor of these two clusters existed north of Cameroon among non-Bantu-speaking peoples in or around present-day Nigeria. From this common ancestral population, a group of migrants separated and traveled west and north to the area around present-day Senegal and the Gambia. These migrants could have taken a coastal or an inland route. The finding that the Senegal haplotype was the predominant haplotype in the sample of Mende from Sierra Leone is consistent with a coastal route. We do not have data to investigate an inland route, but we note that the frequency of the sickle allele is higher along the coast than inland.53 Classical haplotypes in eastern Arabia tend to have the Arabian/Indian designation, whereas those in western Arabia tend to have the Benin designation.18, 21 The Arabian/Indian haplotype has been hypothesized to have originated in either east Saudi Arabia or India.5 Although our samples include only one predicted instance of the Arabian/Indian haplotype, the occurrence of this haplotype in the Luhya in Kenya and its clustering with the predominant haplotype found in Kenya and Uganda are consistent with the hypothesis that the Arabian/Indian haplotype originated in Africa and had an overseas migration route from eastern Africa to eastern Arabia and India.13, 19 In contrast, the absence of the Benin haplotype in the Luhya in Kenya and the Baganda in Uganda provides evidence against an overseas migration route from eastern Africa to western Arabia. Instead, the presence of the Benin haplotype in western Arabia is consistent with an African origin and an overland migration route through northeast Africa.33 Together, our results suggest the following evolutionary history of the sickle mutation. The presence of African ancestry, an African patriline, and/or an African matriline in all sickle carriers, combined with the absence of Arabian or Indian ancestries in the five continental African samples in the 1000 Genomes Project,33 supports an African origin of the sickle mutation. The sickle mutation occurred once approximately 7,300 years ago either in the Sahara or in west-central Africa. In an environment where malaria preexisted, balancing selection took approximately 2,400 years to drive the βS allele to an equilibrium frequency of 12.0%. A population split occurred, possibly in the area around present-day Cameroon. Starting approximately 5,000 years ago, the Bantu expansions resulted in the spread of HAP1 to the CAR, the Great Lakes region, and South Africa. After acquiring one derived allele, the population split into one carrying a haplotype that evolved into HAP16 and another carrying a haplotype that evolved into HAP20. In both clusters, hitchhiking of the derived alleles with the βS allele resulted in similar derived allele frequencies. Our study includes sequence data from western, west-central, and eastern Africa but lacks comparable data from northern, central, and southern Africa. A haplotype has been identified among Sudanese individuals that was classified as atypical but might be a fifth African haplotype.54 It is possible that sampling additional populations could provide more evidence regarding when, where, and in whom the sickle mutation arose. Our findings support classification of sickle haplotypes based on three clusters. Notably, we found that the Senegal haplotype is substructured into two clusters, one containing only Senegal haplotypes and one containing Benin and Senegal haplotypes. This substructuring of haplotypes might have confounded previous assessments of clinical phenotype or disease severity. At the same time, the identification of markers linked to the sickle allele provides an opportunity to investigate possible cis-effects, such as differential expression of HBD, HBE1, and HBG2, on disease severity.
title	Discussion
p	There are two models of the origins of the sickle allele. The multicentric model posits five independent occurrences of the same mutation within the last few thousand years. The unicentric model posits a single occurrence and an older age. We used whole-genome-sequence data to provide insight into this issue. Using haplotypic classification and phylogenetic network analysis, we found clear and consistent evidence for a single origin of the sickle mutation. After accounting for recombination, we estimated that the sickle mutation is 259 [123,395] generations old.
p	Starting from first principles of population genetics, we defined haplotypes on the basis of phased sequence data. In contrast, the classical haplotypes were based on patterns of restriction sites seen in individuals with sickle cell anemia.3, 4 By molecularly mapping restriction sites to the sequence data, we found that the restriction sites correlated poorly with rs334, such that none of the canonical sites were included in our sequence-based haplotypes. This discrepancy can be explained by the fact that the RFLP-defining markers circulate on both βS- and βA-carrying chromosomes, even though the restriction sites were originally used for defining types of βS-carrying chromosomes. In contrast, our haplotypes correlate strongly with rs334 and simultaneously correlate moderately with the restriction sites conditional on the presence of the βS allele, thereby capturing the classical patterns.
p	There are four lines of evidence regarding the age of the sickle mutation: historical data, the simulations of balancing selection, the patriline data, and the analysis of the ancestral recombination graph. Although historical records are sparse, the earliest recorded cases of fevers that could have been caused by malaria were ∼5,000 years ago in China and were possibly due to Plasmodium vivax.2, 43, 44 The earliest recorded case of illness that could have been malaria specifically caused by Plasmodium falciparum could have been ∼4,000 years ago in Egypt and Sumer; however, Plasmodium falciparum could have been present in Africa several thousand years earlier.2 The first recorded cases of sickle cell anemia or, more broadly, sickle cell disease were in Egypt during the predynastic period (∼3200 BC45), in the Persian Gulf during the Hellenistic period (2,130 years before present46), and in Ghana in 1670 AD.47 The existence of the sickle allele in predynastic Egypt constrains the lower bound of the age of the sickle mutation to be 5,200 years. On the basis of these limited data, it is historically plausible that the selective environment preceded the sickle mutation, consistent with our balancing-selection simulations, which showed that the sickle allele would have been lost almost immediately without a heterozygote advantage (under the assumption of recessive lethality).
p	The patriline data corroborate the age of the sickle mutation. Given a single origin, the sickle mutation most likely arose either in a population containing both E1b1a1a1f-L485 (also known as E1b1a1a1a1c-L485) and E1b1a1a1g-U175 (also known as E1b1a1a1a2a-U175) or in a population containing their common ancestor. Thus, the time of origin of the common ancestor of these two haplogroups is a plausible upper bound on the age of the sickle mutation. The common ancestor E1b1a1a1a-M4732 arose 10,500 (95% confidence interval between 9,200 and 12,000) years ago, consistent with the upper credible interval of 11,100 years ago that we estimated for the age of the sickle mutation.
p	The Bantu expansions started ∼5,000 years ago and took ∼2,000 years to cross from present-day Cameroon to the Great Lakes region.48 The wide credible interval (3,400 to 11,100 years) of the age of the sickle mutation does not rule out the possibility that the mutation coincided with the onset of the Bantu expansions. However, our results support the hypothesis that the origin of the βS allele predated the onset of the Bantu expansions. First, the βS allele could have been at or near its equilibrium frequency when the Bantu expansions occurred ∼5,000 years ago, according to our estimation that the sickle mutation originated ∼7,300 years ago and that balancing selection ensued immediately and reached equilibrium in ∼2,400 years in an environment where malaria preexisted. Second, the likelihood that the βS allele is carried by migrating individuals increases with the frequency of that allele in the source population.
p	Our results provide some suggestions as to where the sickle mutation might have originated. Descendants of the Y chromosome haplogroup E1b1a-V38 migrated across the Sahara from east to west,49 possibly around 19,000 years ago.50 E1b1a1-M2 most likely did not originate in eastern or northeastern Africa, but where it originated in either western or central Africa is unclear.49 Accordingly, the sickle mutation most likely did not occur in eastern or northeastern Africa. Our results indicate that the origin of the sickle mutation was in the middle of the Holocene Wet Phase, or Neolithic Subpluvial, which lasted from ∼7,500–7,000 BC to ∼3,500–3,000 BC. This time was the most recent of the Green Sahara periods, during which the Sahara experienced wet and rainy conditions.51 Our results thus support the Green Sahara as a possible place of origin of the sickle mutation. An alternative hypothesis is that the sickle allele arose in west-central Africa,13, 52 possibly in the northwestern portion of the equatorial rainforest.
p	Our results also provide some suggestions as to where the three clusters might have originated. Two splits occurred early in the original βS-carrying population. The first split defined one cluster containing HAP1 and accounting for the Cameroon and CAR haplotypes. It is plausible that HAP1 was carried from an area in or around present-day Cameroon to the area that is presently the CAR, as well as to areas east and south, as part of the Bantu expansions. However, the Bantu expansions did not extend west and north. The second split subsequently separated the clusters containing HAP16 and HAP20, the modal haplotypes accounting for the Benin and Senegal haplotypes, respectively. HAP16 and HAP20 shared one derived mutation, consistent with an early split. Furthermore, given the subsequent accumulation of derived mutations not shared between HAP16 and HAP20, effectively no gene flow occurred between these two descendant populations, consistent with geographic separation. We therefore hypothesize that the common ancestor of these two clusters existed north of Cameroon among non-Bantu-speaking peoples in or around present-day Nigeria. From this common ancestral population, a group of migrants separated and traveled west and north to the area around present-day Senegal and the Gambia. These migrants could have taken a coastal or an inland route. The finding that the Senegal haplotype was the predominant haplotype in the sample of Mende from Sierra Leone is consistent with a coastal route. We do not have data to investigate an inland route, but we note that the frequency of the sickle allele is higher along the coast than inland.53
p	Classical haplotypes in eastern Arabia tend to have the Arabian/Indian designation, whereas those in western Arabia tend to have the Benin designation.18, 21 The Arabian/Indian haplotype has been hypothesized to have originated in either east Saudi Arabia or India.5 Although our samples include only one predicted instance of the Arabian/Indian haplotype, the occurrence of this haplotype in the Luhya in Kenya and its clustering with the predominant haplotype found in Kenya and Uganda are consistent with the hypothesis that the Arabian/Indian haplotype originated in Africa and had an overseas migration route from eastern Africa to eastern Arabia and India.13, 19 In contrast, the absence of the Benin haplotype in the Luhya in Kenya and the Baganda in Uganda provides evidence against an overseas migration route from eastern Africa to western Arabia. Instead, the presence of the Benin haplotype in western Arabia is consistent with an African origin and an overland migration route through northeast Africa.33
p	Together, our results suggest the following evolutionary history of the sickle mutation. The presence of African ancestry, an African patriline, and/or an African matriline in all sickle carriers, combined with the absence of Arabian or Indian ancestries in the five continental African samples in the 1000 Genomes Project,33 supports an African origin of the sickle mutation. The sickle mutation occurred once approximately 7,300 years ago either in the Sahara or in west-central Africa. In an environment where malaria preexisted, balancing selection took approximately 2,400 years to drive the βS allele to an equilibrium frequency of 12.0%. A population split occurred, possibly in the area around present-day Cameroon. Starting approximately 5,000 years ago, the Bantu expansions resulted in the spread of HAP1 to the CAR, the Great Lakes region, and South Africa. After acquiring one derived allele, the population split into one carrying a haplotype that evolved into HAP16 and another carrying a haplotype that evolved into HAP20. In both clusters, hitchhiking of the derived alleles with the βS allele resulted in similar derived allele frequencies.
p	Our study includes sequence data from western, west-central, and eastern Africa but lacks comparable data from northern, central, and southern Africa. A haplotype has been identified among Sudanese individuals that was classified as atypical but might be a fifth African haplotype.54 It is possible that sampling additional populations could provide more evidence regarding when, where, and in whom the sickle mutation arose.
p	Our findings support classification of sickle haplotypes based on three clusters. Notably, we found that the Senegal haplotype is substructured into two clusters, one containing only Senegal haplotypes and one containing Benin and Senegal haplotypes. This substructuring of haplotypes might have confounded previous assessments of clinical phenotype or disease severity. At the same time, the identification of markers linked to the sickle allele provides an opportunity to investigate possible cis-effects, such as differential expression of HBD, HBE1, and HBG2, on disease severity.
back	Web Resources 1000 Genomes Project, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ African Genome Variation Project, https://www.ebi.ac.uk/ega/studies/EGAS00001000959 Ensembl, http://www.ensembl.org/Homo_sapiens/Info/Index E YTree, https://www.yfull.com/tree/E Qatar sequence data, Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/?term=SRP060765 OMIM, http://www.omim.org Supplemental Data Document S1. Figure S1 Spreadsheet S1. Tables S1 and S2 Document S2. Article plus Supplemental Data Acknowledgments We acknowledge the assistance of Neil Hanchard with the restriction site data. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official view of the National Institutes of Health. This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (1ZIAHG200362). Supplemental Data include one figure and two tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.02.003.
sec	Web Resources 1000 Genomes Project, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ African Genome Variation Project, https://www.ebi.ac.uk/ega/studies/EGAS00001000959 Ensembl, http://www.ensembl.org/Homo_sapiens/Info/Index E YTree, https://www.yfull.com/tree/E Qatar sequence data, Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/?term=SRP060765 OMIM, http://www.omim.org
title	Web Resources
p	1000 Genomes Project, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ African Genome Variation Project, https://www.ebi.ac.uk/ega/studies/EGAS00001000959 Ensembl, http://www.ensembl.org/Homo_sapiens/Info/Index E YTree, https://www.yfull.com/tree/E Qatar sequence data, Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/?term=SRP060765 OMIM, http://www.omim.org
p	1000 Genomes Project, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
p	African Genome Variation Project, https://www.ebi.ac.uk/ega/studies/EGAS00001000959
p	Ensembl, http://www.ensembl.org/Homo_sapiens/Info/Index
p	E YTree, https://www.yfull.com/tree/E
p	Qatar sequence data, Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/?term=SRP060765
p	OMIM, http://www.omim.org
sec	Supplemental Data Document S1. Figure S1 Spreadsheet S1. Tables S1 and S2 Document S2. Article plus Supplemental Data
title	Supplemental Data
p	Document S1. Figure S1 Spreadsheet S1. Tables S1 and S2 Document S2. Article plus Supplemental Data
caption	Document S1. Figure S1
title	Document S1. Figure S1
caption	Spreadsheet S1. Tables S1 and S2
title	Spreadsheet S1. Tables S1 and S2
caption	Document S2. Article plus Supplemental Data
title	Document S2. Article plus Supplemental Data
ack	Acknowledgments We acknowledge the assistance of Neil Hanchard with the restriction site data. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official view of the National Institutes of Health. This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (1ZIAHG200362).
title	Acknowledgments
p	We acknowledge the assistance of Neil Hanchard with the restriction site data. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official view of the National Institutes of Health. This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (1ZIAHG200362).
footnote	Supplemental Data include one figure and two tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.02.003.
p	Supplemental Data include one figure and two tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.02.003.

Annnotations TAB TSV DIC JSON TextAE

Denotations: 0
Blocks: 0
Relations: 0

PMC:5985360 / 12088-14146 JSONTXT

Document structure show

Annnotations TAB TSV DIC JSON TextAE

PMC:5985360 / 12088-14146 JSON TXT