Results Molecular Mapping of Restriction Sites We mapped 15 restriction sites, including the seven canonical sites, to the reference human genome sequence. We identified 12 known markers that predict the presence or absence of ten of these sites. Of the canonical sites, we predicted 5′ ε HincII with rs3834466, Gγ1 HindIII with rs2070972, Aγ1 HindIII with rs28440105, and 3′ ψβ HincII with rs968857 (Table 1); similar results were described in a previously reported analysis of rs3834466, rs28440105, rs10128556, and rs968857.40 Pairwise correlation (measured via r2) between these variants and rs334 was weak to nonexistent (Table 1). On the basis of these four RFLP-predicting markers, we identified ten unique haplotypes. V˜2 between the set of ten haplotypes and rs334 was 0.079 (Figure 1). Figure 1 Pairwise Association Plots For each triangle, vertices indicate markers, and edges are labeled with V˜2 association values. The associations on the bottom edges are conditional on the presence of the βS or βA allele. (A) Associations between rs334, the RFLP-predicting markers, and the one marker with pairwise r2≥0.4 with rs334. (B) Associations between rs334, the RFLP-predicting markers, and the set of three markers with pairwise r2≥0.3 with rs334. (C) Associations between rs334, the RFLP-predicting markers, and the set of 27 markers with pairwise r2≥0.2 with rs334. Table 1 Molecular Characterization of the Classical Sickle Haplotypes Site 3′ ψβ HincII Aγ1 HindIII Gγ1 HindIII 5′ ε HincII Sequencea GTTGAC AAGCTT AAGCTT GTTGAC Range 5,260,457–5,260,462 5,269,799–5,269,804 5,274,717–5,274,722 5,291,563–5,291,567 rsID rs968857 rs28440105 rs2070972 rs3834466 Position (hg19) 5,260,458 5,269,799 5,274,717 5,291,563–5,291,564 Senegal + − + − Benin + − − − CAR − − + − Cameroon + + + − Arabian/Indian + − + + r2b 0.000 0.016 0.003 0.020 D′b −0.104 0.930 0.094 −0.853 Ancestral T C C G Status + − − − Derived C A A GT Status − + + + a Underlining indicates the polymorphic position. b Pairwise linkage disequilibrium values are shown with respect to rs334. Distributions of βS and the Classical Haplotypes In the 1000 Genomes Project data, we identified 137 sickle carriers and 0 sickle homozygotes; we predicted the classical haplotypes for all 137 carriers (Table 2). The Benin haplotype was the predominant haplotype in the samples of Esan and Yoruba from Nigeria, the CAR haplotype was the predominant haplotype in the sample of Luhya from Kenya, and the Senegal haplotype was the predominant haplotype in the samples of Mende from Sierra Leone and Mandinka from the Gambia. There was limited representation of the Arabian/Indian (one) and Cameroon (two) haplotypes. The average sickle allele frequency was 12.0% and did not statistically differ among the five continental African samples (χ42=1.48, p=0.830; Table S1). The distribution of matrilines comprised 1 A2, 2 B2, 1 J2, 8 L0, 24 L1, 43 L2, 49 L3, 3 L4, 2 L5, 1 T2, and 3 U6 haplogroups. The distribution of patrilines comprised 2 A1a, 5 E1a, 54 E1b1a, 1 E1b1b, 2 E2b, 1 G2a, 1 I2a, and 2 R1b haplogroups. Of the 54 E1b1a, 29 were E1b1a1a1f and 23 were E1b1a1a1g. Table 2 Distribution of Classical Sickle Haplotypes Samplea Arabian/Indian Benin Cameroon CAR Senegal Atypical ACB 0 4 0 2 3 0 ASW 0 1 1 0 0 0 CLM 0 0 0 1 0 1 ESN 0 18 1 0 5 0 GWD 0 2 0 0 24 0 LWK 1 0 0 19 0 0 MSL 0 3 0 0 17 1 PUR 0 0 0 1 2 0 YRI 0 19 0 0 10 1 Baganda 0 0 0 14 0 0 Zulu 0 0 0 1 0 0 a The population codes are as follows: ACB, African Caribbean in Barbados; ASW, People with African Ancestry in Southwest USA; CLM, Colombians in Medellín, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division, Mandinka; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; PUR, Puerto Ricans in Puerto Rico; and YRI, Yoruba in Ibadan, Nigeria. In the African Genome Variation Project data, we identified 14 sickle carriers in the Baganda and one sickle carrier in the Zulu. We predicted that all 15 of these individuals carried the CAR haplotype (Table 2). In the Qatar sample, we identified four sickle carriers, all with insufficient information to predict the haplotypes. Haplotype Classification Based on Linkage Disequilibrium We defined haplotypes centered on rs334 in the 504 continental Africans from the 1000 Genomes Project. We recorded pairwise linkage disequilibrium (LD) between rs334 and all phased, diallelic markers on chromosome 11 (Figure 2). The largest value of r2 with rs334 was 0.407. There was only one marker, rs149481026, with r2≥0.4. This one marker was more strongly associated with rs334 (V˜2=0.406) than the set of four RFLP-predicting markers was (Figure 1A). Figure 2 Linkage Disequilibrium with rs334 We calculated pairwise r2 with rs334 across chromosome 11 among 504 continental Africans from the 1000 Genomes Project. We plotted r2 for all 4,024,958 phased diallelic markers. To strengthen the association with rs334, we investigated lower levels of r2. There were three markers with r2≥0.3: rs183055323, rs149481026, and rs73404549. These three markers ranged across 132.6 kb, including the entire β-globin cluster. This interval was substantially smaller than the average distance between phasing errors. On the basis of these three markers, we identified five unique haplotypes (Table 3). One haplotype contained all occurrences of the Arabian/Indian, Cameroon, and CAR haplotypes. This haplotype contained the ancestral allele at all three markers. The Senegal haplotype was distributed across all five haplotypes, and the Benin haplotype was distributed across four of the five haplotypes. V˜2 between these three markers and rs334 was 0.753 (Figure 1B). Table 3 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using Three Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000 NA NA NA NA NA NA HAPA 0010 1 4 2 38 1 1 HAPB 0011 0 2 0 0 1 0 HAPC 0110 0 1 0 0 1 1 HAPD 0111 0 40 0 0 15 0 HAPE 1010 0 0 0 0 43 1 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. To improve cross-classification with the Senegal and Benin haplotypes, we identified 27 markers with r2≥0.2. These markers extended across 725.3 kb, which is still less than the average distance between phasing errors. On the basis of these markers, we identified 59 unique haplotypes, of which 18 carried the sickle allele at rs334. A 19th sickle haplotype was observed once in the ACB sample, and a 20th sickle haplotype was observed once in the Baganda sample (Table 4). V˜2 between these 27 markers and rs334 was 0.728 (Figure 1C). The most common haplotype carried the ancestral allele at all 28 sites and accounted for 68.5% of all haplotypes in the continental Africans. Globally, the ancestral haplotype had a frequency of 91.9%, was the most frequent haplotype in all 26 samples in the 1000 Genomes Project, and was the only haplotype in 15 of those samples, including all ten samples from East Asia and Europe. 13 of the sickle haplotypes in the Baganda, the one in the Zulu, and all four in the Qatari were identical to HAP1, the haplotype most commonly designated CAR (Table 4). Additionally, the four Qatari carriers had (1) >7.8% autosomal African ancestry, (2) an African mitochondrial haplogroup, or (3) an African Y chromosome haplogroup. The three most common haplotypes (HAP1, HAP16, and HAP20) correlated primarily with the CAR, Benin, and Senegal haplotypes, respectively (Table 4). Table 4 Distribution of Sickle Haplotypes under a Sequence-Based Classification Scheme Using 27 Markers Name Haplotypea Arabian/Indian Benin Cameroon CAR Senegal Atypical Ancestor 0000000000000000000000000010 NA NA NA NA NA NA HAP1 0000000000000000000010000010 1 0 2 37 0 1 HAP2 0000000000000000000010110010 0 4 0 0 0 0 HAP3 0000000000000000000010110101 0 0 0 0 1 0 HAP4 0000000000000000001110110101 0 2 0 0 0 0 HAP5 0000000000000000111110110101 0 5 0 0 0 0 HAP6 0000000000000001100011001010 0 0 0 0 3 0 HAP7 0000000100000100111110110101 0 1 0 0 0 0 HAP8 0000000100000100111110110110 0 0 0 0 1 0 HAP9 0000111011111011100011001010 0 0 0 0 1 0 HAP10 0001000000000000000010000010 0 0 0 1 0 0 HAP11 0001000100000100111110110101 0 2 0 0 1 0 HAP12 0011000100000100101110000010 0 0 0 0 1 0 HAP13 0011000100000100111110000010 0 1 0 0 0 1 HAP14 0011000100000100111110110010 0 0 0 0 1 0 HAP15 0011000100000100111110110100 0 2 0 0 1 0 HAP16 0011000100000100111110110101 0 23 0 0 7 0 HAP17 0011000100000100111110110110 0 6 0 0 5 0 HAP18 0011000100000100111110110111 0 1 0 0 0 0 HAP19 1100111011111011100010000010 0 0 0 0 3 1 HAP20 1100111011111011100011001010 0 0 0 0 36 0 a 0 indicates the reference allele, and 1 indicates the alternate allele according to the coding scheme in the 1000 Genomes Project VCF files. The sickle site rs334 is underlined. Given that our sets of 1, 3, and 27 markers more strongly associated with rs334 than did the set of RFLP-predicting markers, we assessed association between our sets of markers and the set of RFLP-predicting markers. Our sets of markers were moderately associated with the set of RFLP-predicting markers when conditioned on the presence of βS and weakly associated when conditioned on the presence of βA (Figure 1). Bioinformatic Annotation Different haplotypes might be associated with different clinical phenotypes or disease severity.41 Using Ensembl and HaploReg version 4.1,42 we annotated each of the 27 markers in linkage disequilibrium at r2≥0.2 with rs334 (Table S1). Nine sites were marked on histones as promoters or enhancers, possibly implicating differential expression of HBB, HBD (MIM: 142000), HBE1 (MIM: 142100), or HBG2 (MIM: 142250). Three of these nine sites, rs334, rs73402608, and rs112336602, were also annotated as bound by proteins, possibly implicating the same four genes. In addition, rs1039215 is correlated with gene expression, most strongly with HBG2 (Table S2). Balancing Selection We modeled balancing selection assuming that the relative fitness of the βA/βA homozygote was 1, the relative fitness of the βS/βS homozygote was 0, and the relative fitness of the βA/βS heterozygote was 1+s. On the basis of the 504 continental Africans from the 1000 Genomes Project, we estimated that s=0.158, which was in agreement with previous estimates.14 Next, by assuming a single initial copy and an effective population size Ne=25,542, we modeled random genetic drift plus balancing selection to estimate how many generations it would take for an equilibrium frequency of 12.0% to be reached. We found that the mutant allele was lost 74.6% of the time and, conditional on reaching equilibrium, reached a frequency of 12.0% after an average of 87 (95% confidence interval [68,124]) generations, or approximately 2,400 (95% confidence interval from 1,900 to 3,500) years. We stress that this value is not the age of the sickle mutation nor the age since the onset of balancing selection but rather the time to reach a frequency of 12.0%. To determine the fate of the mutant allele in the absence of heterozygote advantage, we repeated the simulation while assuming s=0. We found that the mutant allele was lost after an average of 12 generations (95% confidence interval [1,92]), with a median of two generations. Phylogenetic Network Analysis We used split decomposition analysis to infer the phylogeny of the 20 sickle haplotypes, whereby the phylogeny was rooted by the ancestral haplotype. (Figure 3). The network revealed that the sickle mutation occurred once in the background of the ancestral haplotype and gave rise to HAP1, which is associated predominantly with the CAR haplotype. One cluster containing HAP1 and HAP10 accounted for all occurrences of the Cameroon, CAR, and Arabian/Indian haplotypes (Table 4). Two additional clusters were derived from this haplotype and not from the ancestral haplotype. One cluster contained HAP6, HAP9, HAP19, and HAP20; all associated with the Senegal haplotype (Table 4). In this cluster, the modal haplotype, HAP20, differed from HAP1 by 15 derived mutations (Table 4 and Table S1). This cluster accounted for 70.5% of occurrences of the Senegal haplotype. The other cluster contained the remaining 14 haplotypes, accounting for all occurrences of the Benin haplotype and 29.5% of the occurrences of the Senegal haplotype. In this cluster, the modal haplotype, HAP16, differed from HAP1 by 13 derived mutations (Table 4; Table S1). Other than the sickle mutation, only one derived mutation, at rs10655224, was shared between HAP16 and HAP20. The remaining 14 derived mutations in HAP20 had an average frequency of 11.3% in the Gambian in Western Divisions in the Gambia (GWD) and Mende in Sierra Leone (MSL) samples and 0.1% in the Esan in Nigeria (ESN) and Yoruba in Ibadan, Nigeria (YRI) samples (Table S1). Conversely, the remaining 12 derived mutations in HAP16 had an average frequency of 13.4% in the ESN and YRI samples and 2.2% in the GWD and MSL samples (Table S1). These 26 derived alleles had an average frequency of 0.4% in the Luhya in Webuye, Kenya (LWK) sample (Table S1). Figure 3 Split Decomposition Networks of Sickle-Carrying Haplotypes (A) Network of 20 distinct sickle-carrying haplotypes, rooted by the ancestral haplotype. The haplotypes are defined in Table 4. The single branch leading from the ancestral root is the only branch to which the sickle mutation contributes, indicating a single origin of the sickle mutation. The scale bar represents 0.01 mutations/site. (B) The subnetwork showing the ancestral root and the three modal haplotypes. This subnetwork emphasizes that HAP16 and HAP20 share a common ancestor and that this common ancestor is derived from HAP1. The scale bar represents 0.1 mutations/site. The Ancestral Recombination Graph Using coalescent theory, we sampled the posterior distribution of the ancestral recombination graph with the 1,008 haplotypes, including 121 sickle haplotypes, from the 504 continental Africans from the 1000 Genomes Project. Over a grid of 15 pairs of mutation and recombination rates, a mutation rate of 0.97×10−8 mutations per generation per site and a recombination rate of 1.5×10−8 recombinations per generation per site yielded the best fit to the data. Given these two rates, the trace of the posterior number of recombination events visually indicated convergence to a stationary distribution (Figure S1). More formally, Geweke’s diagnostic indicated that the sampled values came from a stationary distribution (p=0.063). Heidelberger and Welch’s diagnostic also indicated that the sampled values came from a stationary distribution (p=0.163) and further indicated that there was no need to discard initial iterations. We estimated the age of the sickle mutation as 259 (95% credible interval [123,395]) generations, or approximately 7,300 years (95% credible interval from 3,400 to 11,100 years).