PMC:1271393 / 1614-1618 JSONTXT
The Heritage of Pathogen Pressures and Ancient Demography in the Human Innate-Immunity CD209/CD209L Region
Abstract
The innate immunity system constitutes the first line of host defense against pathogens. Two closely related innate immunity genes, CD209 and CD209L, are particularly interesting because they directly recognize a plethora of pathogens, including bacteria, viruses, and parasites. Both genes, which result from an ancient duplication, possess a neck region, made up of seven repeats of 23 amino acids each, known to play a major role in the pathogen-binding properties of these proteins. To explore the extent to which pathogens have exerted selective pressures on these innate immunity genes, we resequenced them in a group of samples from sub-Saharan Africa, Europe, and East Asia. Moreover, variation in the number of repeats of the neck region was defined in the entire Human Genome Diversity Panel for both genes. Our results, which are based on diversity levels, neutrality tests, population genetic distances, and neck-region length variation, provide genetic evidence that CD209 has been under a strong selective constraint that prevents accumulation of any amino acid changes, whereas CD209L variability has most likely been shaped by the action of balancing selection in non-African populations. In addition, our data point to the neck region as the functional target of such selective pressures: CD209 presents a constant size in the neck region populationwide, whereas CD209L presents an excess of length variation, particularly in non-African populations. An additional interesting observation came from the coalescent-based CD209 gene tree, whose binary topology and time depth (∼2.8 million years ago) are compatible with an ancestral population structure in Africa. Altogether, our study has revealed that even a short segment of the human genome can uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host genes as well as traces of admixture among archaic hominid populations.
Introduction
Infectious diseases have been paramount among the threats to health and survival for most of human evolutionary history (Haldane 1949; Lederberg 1999; Harpending and Rogers 2000; Cooke and Hill 2001). The interaction of the human host with a wide variety of pathogens has been accompanied by genetic adaptations to spatially and temporally fluctuating selective pressures imposed by the infectious agents. Numerous studies have sought the genetic imprint of natural selection imposed by pathogen pressures in human genes involved in immune response or, more generally, in host-pathogen interactions (Vallender and Lahn 2004). For example, natural selection has acted on such genes as MHC, β-globin, G6PD, IL-2, IL-4, TNFSF5, the Duffy blood group genes, and CCR5 (Ohta 1991; Hughes et al. 1994; Flint et al. 1998; Hamblin and Di Rienzo 2000; Tishkoff et al. 2001; Bamshad et al. 2002; Sabeti et al. 2002; Verrelli et al. 2002). However, little is known about genetic variation of genes involved in direct recognition of pathogens, or pathogens' products, and virtually no studies have investigated the extent to which pathogens have exerted selective pressures on the innate immune system.
The phylogenetically ancient innate immune system governs the initial detection of pathogens and stimulates the first line of host defense (Medzhitov and Janeway 1998a, 2000, 2002; Janeway and Medzhitov 2002). Recognition of pathogens is mediated by phagocytic cells through germline-encoded receptors, known as “pattern recognition receptors,” which detect pathogen-associated molecular patterns that are characteristic products of microbial physiology (Kimbrell and Beutler 2001; Janeway and Medzhitov 2002). This initial interaction is then translated into a set of endogenous signals that ultimately lead to the induction of the adaptive immune response (Medzhitov and Janeway 1998b).
In recent years, the C-type lectin receptors have received much attention in the area of innate immunology, the results of which were novel functional insights into the primary interface between host and pathogens (Medzhitov 2001; Cook et al. 2003; Fujita et al. 2004; Geijtenbeek et al. 2004; McGreal et al. 2004). In this context, two prototypic members of the C-type lectin–receptor family are particularly interesting, since they can act as both cell-adhesion receptors and pathogen-recognition receptors. These lectins include CD209 (DCSIGN: dendritic cell–specific ICAM-3 grabbing nonintegrin [MIM 604672]) and its close relative CD209L (L-SIGN: liver/lymph node–specific ICAM-3 grabbing nonintegrin [MIM 605872]) (Curtis et al. 1992; Geijtenbeek et al. 2000b, 2004; Soilleux et al. 2000; Pohlmann et al. 2001). These lectin-coding genes are located on chromosome 19p13.2-3, within an ∼26-kb segment, and result from a duplication of an ancestral gene (Bashirova et al. 2003; Soilleux 2003). An additional characteristic of both CD209 and CD209L is the presence of a neck region, primarily made up of seven highly conserved 23-aa repeats, that separates the carbohydrate-recognition domain involved in pathogen binding from the transmembrane region. This neck region presents high nucleotide identity between repeats, both within each molecule and between CD209 and CD209L. It has been shown that this region plays a crucial role in the oligomerization and support of the carbohydrate-recognition domain; therefore, it influences the pathogen-binding properties of these two receptors (Soilleux et al. 2000, 2003; Feinberg et al. 2005). In regard to expression profiles, CD209 is expressed primarily on phagocytic cells, such as dendritic cells and macrophages, whereas CD209L expression is restricted to endothelial cells in liver and lymph nodes (Bashirova et al. 2001; Soilleux et al. 2001, 2002). As pathogen-recognition receptors, the two lectins have been shown to recognize a vast range of microbes, some of which are of major public health importance (Geijtenbeek et al. 2004). Indeed, CD209 captures bacteria such as Mycobacterium tuberculosis, Helicobacter pylori, and certain Klebsiela pneumonia strains; viruses such as HIV-1, Ebola virus, cytomegalovirus, hepatitis C virus, Dengue virus, and SARS-coronavirus; and parasites like Leishmania pifanoi and Schistosoma mansoni (Geijtenbeek et al. 2000a, 2003; Alvarez et al. 2002; Colmenares et al. 2002; Halary et al. 2002; Appelmelk et al. 2003; Lozach et al. 2003; Tailleux et al. 2003; Tassaneetrithep et al. 2003; Bergman et al. 2004; Marzi et al. 2004). With regard to CD209L, studies to date have shown an interaction with a variety of viruses, including HIV, hepatitis C, Ebola, and coronavirus, as well as with the parasite Schistosoma mansoni (Bashirova et al. 2001; Alvarez et al. 2002; Gardner et al. 2003; Jeffers et al. 2004; Van Liempt et al. 2004). In this context, the efficiency of the two lectins in pathogen recognition and subsequent processing may have important consequences for the quality of host immune responses and consequent pathogen control and/or clearance.
An important step forward in the understanding of human adaptation to pathogens and control of infectious diseases includes the description of quality and quantity of genetic variation in genes involved in host recognition of infectious agents. Given the direct interaction of CD209 and CD209L with a large variety of pathogens, the CD209/CD209L genomic region provides an excellent model system to illustrate the extent to which pathogens have exerted selective pressures on host immunity genes. An additional feature that makes these genes highly interesting in evolutionary studies is that they are likely to have been influenced by similar genomic forces (recombination, mutation rates, etc.) because of their close physical proximity (∼15 kb), high nucleotide (73%) and amino acid (77%) identity, and identical exon-intron organization (Soilleux 2003) (fig. 1 ). In addition, it has been proposed that gene duplication of immunity genes is a molecular strategy developed by the host to enlarge its defense potential (Ohno 1970; Trowsdale and Parham 2004). A number of immune-system gene families have evolved, by gene duplication followed by natural selection, to provide responses to a wider range of pathogens, with welldocumented examples in immunoglobulin and MHC genes (Hughes et al. 1994; Ota et al. 2000). In this context, duplicated genes in cis, like CD209 and CD209L, may have undergone differential selective pressures to enlarge the defense role of these lectins. To address these complex issues, we performed a sequence-based survey of the entire CD209/CD209L region in a panel of individuals of different ethnic origins. Here, we report evidence showing that these two closely related innate immunity genes have gone through completely different evolutionary processes that are reflected in their current patterns of diversity. In addition, our study provides novel insights into how pathogens have shaped the patterns of variability of immunity genes resulting from gene duplication.
Figure 1 Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown.
Material and Methods
Population Samples
Sequence variation of the CD209/CD209L region was determined in 41 sub-Saharan Africans, 43 Europeans, and 43 East Asians, in a total of 254 chromosomes from the Human Genome Diversity Panel (HGDP)–CEPH panel (Cann et al. 2002). More-detailed information about the composition of the three major ethnic groups can be found in table 1 . The variation in the repeat number of the neck region of CD209 and CD209L was defined in the entire HGDP-CEPH panel, comprising 1,064 DNA samples from 52 worldwide populations. In addition, the orthologous regions for both genes were sequenced in four chimpanzees (Pan troglodytes).
Table 1 Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity
Location and Population Geographic Origin No. of Chromosomes
Africa:
Biaka Pygmies Central African Republic 12
Mbuti Pygmies Democratic Republic of Congo 10
Bantu Fanga Gabon 12
Bantu, northeastern Kenya 10
Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4
Bantu, southwestern Herero South Africa (Bantu, southwestern) 4
Mandenka Senegal 12
San Namibia 8
Yoruba Nigeria 10
Europe:
Adygei Russian Caucasus 4
French France 16
French Basque France 12
Sardinian Italy 12
Russian Russia 12
Orcadian Orkney Islands 14
North Italian Italy (Bergamo) 16
East Asia:
Japanese Japan 42
Han China 4
Tujia China 4
Yizu China 4
Miaozu China 4
Orogen China 4
Daur China 4
Mongola China 4
Hezhen China 4
Xibo China 4
Uygur China 4
Dai China 4
a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection.
Molecular Analyses
The sequenced fragments of the CD209/CD209L genomic region are shown in figure 1. The entire CD209 region—including exons, introns, and ∼1 kb of the 5′ UTR corresponding to the promoter region—was sequenced, for a total of 5.5 kb per individual. For CD209L, we sequenced a total of ∼5.4 kb per individual, following the same approach used for CD209, with the exception of the neck region. That region was genotyped for its number of repeats, since it turned out to be highly polymorphic, which prevented the sequencing process. Genotyping was performed by a single PCR amplification followed by migration in 2% agarose gels. Human primers were used to both amplify and sequence the orthologous regions in chimpanzees. However, because of polymorphisms specific to the chimpanzee lineage, we could not obtain the entirety of the sequence. Thus, 4.9 kb (90% of the total) of the chimpanzee CD209 sequence were obtained, and 5.3 kb (98% of the total) of CD209L. Detailed information on primer sequences and PCR amplification conditions is available on request. All nucleotide sequences were obtained using the Big Dye terminator kit and the 3100 automated sequencer from Applied Biosystems. Sequence files and chromatograms were inspected using the GENALYS software (Takahashi et al. 2003; Centre National de Genotypage). As a measure of quality control, when new mutations were identified in primer binding regions, new primers were designed and sequence reactions were repeated, to avoid allele-specific amplification. All singletons observed in our data set were systematically reamplified and resequenced.
Statistical Analyses
On the basis of the levels of diversity observed in the CD209/CD209L genomic region, we calculated the average number of pairwise differences (π) and the Watterson's estimator (θw) (Watterson 1975). Under the standard neutral model of a randomly mating population of constant size, these are unbiased estimators of the population mutation rate θ=4Neμ, where N e is the diploid effective population size and μ is the mutation rate per generation per site. To test whether the frequency spectrum of mutations conformed to the expectations of this standard neutral model, we calculated Tajima's D (Tajima 1989) and Fay and Wu's H tests (Fay and Wu 2000). P values for the different tests were estimated from 104 coalescent simulations under an infinite-site model, with use of a fixed number of segregating sites and the assumption of no recombination, which has been shown to be a conservative assumption (Gilad et al. 2002). In parallel, we estimated P values for all these tests, using the empirical distribution obtained from sequencing data of 132 genes in a panel of 24 African Americans and 23 European Americans (Akey et al. 2004). All these analyses, together with the interspecies McDonald-Kreitman (McDonald and Kreitman 1991) and K A/K S (Kimura 1968) tests, were performed using the DnaSP package (Rozas et al. 2003). Genetic distances between populations (F ST) and heterozygosity values were estimated using the Arlequin package (Schneider et al. 2000). F ST statistical significance was assessed using 10,000 bootstrap replications. To bear out a deficit or an excess of heterozygosity in the neck region of CD209 and CD209L, we used BOTTLENECK (Cornuet and Luikart 1996) to compute for each geographic region, the distribution of the heterozygosity expected from the observed number of alleles, given the sample size (n) under the assumption of mutational-drift equilibrium. This distribution was obtained through simulation of the coalescent process of n genes under two mutational models, the infinite-site model and the stepwise mutation model. In addition, to obtain information on the fraction of genetic variance in the neck region that is due to intra- and interpopulation differences, we performed an analysis of molecular variance (AMOVA), using the Arlequin package (Schneider et al. 2000). The AMOVA results were compared with those of 377 microsatellites analyzed in the same population panel (Rosenberg et al. 2002).
Haplotype reconstruction was performed by use of the Bayesian statistical method implemented in Phase (v.2.1.1) (Stephens and Donnelly 2003). We applied the algorithm five times, using different randomly generated seeds, and consistent results were obtained across runs. After haplotype reconstruction, linkage disequilibrium (LD) between pairs of SNPs was computed using Lewontin's D′ index (Lewontin 1964). For this analysis, only markers presenting a minimum allele frequency (MAF) of 10% were considered, since rare alleles have been shown to present a higher probability of being in significant LD than do common ones (Reich et al. 2001). The graphic display of the LD plots was constructed using GOLD (Abecasis and Cookson 2000; Center for Statistical Genetics). To support the existence of a recombination hotspot in the region under study, we used the hotspot-recombination model implemented in Phase (v.2.1.1). Under this model, we assumed that there was, at most, one hotspot of unknown position. We then estimated the background population-recombination rate (ρ) and the relative intensity of any recombination hotspot. To obtain better estimates, we increased 10 times the number of iterations of the final run of the algorithm. All our estimations were obtained by averaging results of five independent runs with use of different seed numbers. Since the model used is Bayesian, we could also estimate, for each population, the posterior probability of a hotspot of intensity >1 (λ>1) and >10 (λ>10).
We obtained the gene tree and estimated the time of the most recent common ancestor (T MRCA) for CD209, using the maximum-likelihood coalescent method implemented in GENETREE (Griffiths and Tavare 1994). The mutation rate μ for each gene was estimated on the basis of the net divergence between humans and chimpanzees and under the assumption both that the species separation occurred 5 million years ago (MYA) and of a generation time of 20 years. Using this μ and θ maximum likelihood (θML), we estimated the effective population size parameter (N e). With the assumption of a generation time of 20 years and the estimated N e, the coalescence time, scaled in 2Ne units, was converted into years. The coalescence process implemented in SIMCOAL2 (Laval and Excoffier 2004) allowed us to estimate the probability of the T MRCA for CD209, through 2×104 simulations, with use of both the number of observed segregating sites and the estimated N e .
Results
We determined sequence diversity in the CD209 and CD209L genes (fig. 1) as well as length variation of the neck region in 254 chromosomes originating from three major ethnic groups: sub-Saharan Africans, Europeans, and East Asians. In addition, the orthologous sequences were obtained in four chimpanzees, to infer the ancestral state at each site, to estimate the divergence between humans and chimpanzees, and to perform a number of interspecies neutrality tests.
Patterns of Nucleotide and Haplotype Diversity in the CD209/CD209L Region
For CD209, we identified a total of 79 SNPs and 2 indels, including 5 nonsynonymous, 5 synonymous, and 71 noncoding variants. The five nonsynonymous SNPs were all located in the neck region (exon 4): SNPs 1839 (Arg→Gln), 1888 (Glu→Asp), and 1908 (Arg→Gln) achieved a frequency of ∼15%, and SNP 1970 (Leu→Val), a frequency of 6%. These mutations were restricted to the African sample. SNP 1472 (Ala→Thr) was observed as a singleton in an East-Asian individual. For CD209L, we identified 64 SNPs and 2 indels, including 4 nonsynonymous and 62 noncoding variants. The four nonsynonymous variants were located in different exons: SNP 141 (Thr→Ala) in exon 2, SNP 3476 (Asp→Asn) in exon 5, SNP 4268 (Thr→Ala) in exon 6, and SNP 5580 (Arg→Gln) in exon 7. All these mutations were singletons except SNP 3476, which presented high frequencies for its derived allele in all geographic regions: 97.6% in Africans, 57% in Europeans, and 77% in East Asians. All variable sites were in Hardy-Weinberg equilibrium for both CD209 and CD209L, after Bonferroni correction for multiple testing.
The allelic composition of CD209 and CD209L haplotypes and their frequency distribution in the three major ethnic groups is illustrated in figure 2 , along with the haplotype composed of the ancestral allelic state of each SNP inferred from chimpanzee data. For CD209, we identified 42 different haplotypes, with an overall heterozygosity of 84% (table 2 ). Three major haplotypes (H2, H29, and H40) accounted for ∼50% of the African variability, whereas they were at very low frequency (H2 at ∼5%) or absent (H29 and H40) in Europeans and East Asians (fig. 2A). In turn, the two haplotypes (H1 and H3) that accounted for 58% and 83% of the European and East Asian variability, respectively, were observed at very low frequency (H1 at 6%) or even absent (H3) in Africa. However, H3, which had a frequency of 36% and 20% in Europe and East Asia, respectively, is just a one-step mutation (SNP 871) from H2, the most frequent haplotype in the African sample. The most interesting observation of the CD209 haplotype variability was the presence of a highly divergent haplotype cluster. This cluster, which contains haplotypes 40–42 (referred to here as “cluster A”), differs from all other haplotypes (referred to here as “cluster B”) by 35 fixed positions (fig. 2A). Cluster A is Africa specific and is present at a frequency of ∼15%, whereas cluster B is present in the remaining African and all non-African samples. It is worth noting that three (SNPs 1839, 1888, and 1908) of the five nonsynonymous mutations identified for this gene are unique to cluster A. In all cases, these three mutations were segregating together, with the exception of one haplotype, H41, which does not contain the SNP 1839. Samples from cluster A are geographically widespread over the entire African continent (i.e., two San from Namibia, three Bantus from Gabon and two from South Africa, three Yorubans from Nigeria, and two Mandenka from Senegal). For CD209L, 74 different haplotypes were observed (fig. 2B), with an overall heterozygosity of 94% (table 2). Only one haplotype (H38) at a frequency of ∼15% was shared in the three continental regions.
Figure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion.
Table 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations
Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H
CD209:
African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d
European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26
East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d
Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3
CD209L:
African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52
European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61
East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43
Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8
Note.—
The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively.
a HD = haplotype diversity (%).
b Nucleotide diversity per base pair (×10−4).
c Watterson's estimator per base pair (×10−4).
d .02
10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. The strong decay in LD observed in the intergenic region (fig. 3), which spans only ∼14 kb, suggests the occurrence of a number of recombination events. To test the hypothesis of a possible recombination hotspot situated within this region, recombination parameters across the entire CD209/CD209L region (∼26 kb) were computed for the three populations, by use of the recombination model implemented in Phase (v.2.1.1) (fig. 4 ). This model (Stephens and Donnelly 2003) estimates the position and relative intensity of the hotspot (λ) as compared with the background population recombination rate (ρ) (see the “Material and Methods” section). A λ value of 1 corresponds to absence of recombination-rate variation, whereas λ values >1 indicate the presence of a hotspot. The model detected the occurrence of a hotspot in the intergenic region, with Africans presenting a λ of 18, whereas Europeans and East Asians exhibited λ values of 63 and 53, respectively (fig. 4). We estimated the posterior probabilities of a hotspot of any kind, Pr(λ>1), and of at least 10 times the background recombination rate, Pr(λ>10). Pr(λ>1) was 100% for all population groups, and Pr(λ>10) was 64% for Africans, 97% for Europeans, and 92% for East Asians. Thus, our data clearly indicate a relative increase of the recombination levels between the two genes, which suggests the occurrence of a hotspot of recombination, the magnitude of which varies among the major ethnic groups. However, our data do not include intergenic SNPs; therefore, the exact location and width of the recombination hotspot within the intergenic region remains unclear, since this observation would be consistent with either an intense narrow hotspot or a weaker but wider hotspot. Figure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. Neutrality Tests The identification of a strong decay in LD between CD209 and CD209L facilitated the interpretation of neutrality tests, because the noise introduced by hitchhiking effects between the genes is reduced. We applied Tajima's D and Fay and Wu's H tests to determine whether these statistics significantly deviated from expectations under neutrality, using both coalescent simulations and the empirical distribution obtained from Akey et al. (2004). Globally, Tajima's D test indicated different tendencies for the two genes (table 2). CD209 always yielded negative values for Tajima's D but never achieved significance to reject the hypothesis of neutrality, whereas CD209L yielded significantly positive values for non-African populations, with use of both coalescent simulations and the empirical distribution. For Fay and Wu's H test, the hypothesis of neutrality was rejected for CD209 in the African and East Asian samples (table 2). To evaluate the selective pressures at the protein level, we performed two interspecies tests: K A/K S, which gives the ratio of nonsynonymous and synonymous changes between species, and the McDonald-Kreitman test, which tests the null hypothesis that the ratio of the number of fixed differences to polymorphisms is the same for both nonsynonymous and synonymous mutations. For the K A/K S test, CD209 and CD209L showed similar values, 0.34 and 0.37, respectively. For the McDonald-Kreitman test, the hypothesis of neutrality was rejected for only CD209, because of a clear lack of nonsynonymous polymorphic sites (table 3 ). Table 3 McDonald-Kreitman Test Results No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. a Mutations in introns are considered synonymous. Neck-Region Length Variation in Worldwide Populations The identical genomic organization of CD209 and CD209L is extended to the neck region, which, in both genes, encodes a track of seven coding repeats of 23 aa each (fig. 1) (Soilleux et al. 2000). A previous study has shown that the length of the neck region of CD209L varied between individuals of European descent (Bashirova et al. 2001). To investigate the degree of polymorphism of the neck region in both CD209 and CD209L, we genotyped it in the entire HGDP-CEPH panel (1,064 individuals from 52 worldwide populations). Striking differences were observed between the two genes (see fig. 5 and table 4 for detailed allele frequencies in each population). For CD209, virtually no variation was observed, and the 7-repeat allele accounted for 99% of the total variability. Despite this limited variation, eight different alleles were observed, with an allele size range of 2–10 repeats, not including a 9-repeat allele. The geographic region that presented the highest variability was the Middle East, with five of the eight different alleles observed (fig. 5A and table 4). For CD209L, a completely different pattern emerged, with strong variation in allelic frequencies of different repeat numbers. Of the seven alleles observed (from 4–10-repeat allele size classes), the three most common overall were the 7- (57.42%), the 5- (23.92%), and the 6- (11.37%) repeat alleles. European, Asian, and Pacific populations presented a mosaic composition of different allelic classes, whereas 7- and 6-repeat alleles accounted for most (96%) of the African diversity (fig. 5B). The strong difference in the neck-region lengths between the two genes was consequently visible in the heterozygosity values: CD209 exhibited an overall heterozygosity of only 2%, whereas CD209L presented a value of 54% (table 5 table 5). Our results showed that the levels of heterozygosity observed at CD209 were considerably lower than expected, regardless of the mutation model considered (i.e., Infinite Site or Stepwise Mutation Models) (table 5). In strong contrast, although not statistically significant for individual populations, CD209L exhibited a pattern of an excess of heterozygosity in all populations. Figure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. Table 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 a Heterozygosity values. b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. Table 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. a Under the infinite-site model. b Probability of the observed heterozygosity under the infinite-site model. c Probability of the observed heterozygosity under the stepwise mutational model. Time of the Most Recent Common Ancestor for CD209 The low levels of intragenic recombination observed in CD209 allowed maximum-likelihood coalescent analysis (Griffiths and Tavare 1994) for estimation of the time scale of the origin and evolution of this gene. Since this method assumes an infinite-site model without recombination, the same analysis for CD209L was not conducted because of the substantial amount of recombinant haplotypes observed. For CD209, only 29 of the 254 chromosomes analyzed had to be excluded, as did a single segregating site (SNP 939). The resulting CD209 gene tree estimate, rooted with the chimpanzee sequence (i.e., the chimpanzee sequence was used to define ancestral/derived status of human mutations), is shown in figure 6 . The tree is partitioned into two deep branches that correspond to haplotype clusters A and B. African samples were observed in both sides of the deepest node of the tree (i.e., in both clusters A and B), whereas non-African samples are restricted to one branch of the tree (i.e., cluster B). The maximum-likelihood estimate of θ (θML) for CD209 was 8.4. On the basis of this θML value and the estimated mutation rate (1.54×10−4 per gene per generation), the effective population size (N e) was 13,636, a value comparable to most figures reported in the literature (for a review, see Tishkoff and Verrelli [2003]). The T MRCA of the CD209 tree was then estimated at 2.8±0.22 MYA, one of the oldest T MRCA values estimated so far in the human genome (Excoffier 2002). Figure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. Discussion The CD209/CD209L region possesses a number of characteristics that make it a powerful tool for evolutionary inference. These two genes are not in LD, despite their very close physical vicinity (∼15 kb), and each of them behaves as an independent genetic entity. Moreover, our results suggest that the CD209/CD209L region is a uniform landscape of genomic forces, since the two lectin-coding genes present similar mutation rates, as well as high nucleotide identity and conserved exon-intron organization (fig. 1). Contrasting Patterns of Diversity in the CD209/CD209L Region Our diversity study revealed completely different patterns for the two genes. First, levels of nucleotide diversity (π) were found to be much lower for CD209 than for CD209L (table 2). On the basis of 1.42 million SNPs, the International SNP Map Working Group defined 7.5×10−4 as the average value of nucleotide diversity for the human genome and showed that 95% of all bins presented π values varying from 2.0×10−4 to 15.8×10−4 (Sachidanandam et al. 2001). In addition, an independent study analyzed nucleotide and haplotype diversity for 313 genes and defined the average π value as 5.4×10−4 (Stephens et al. 2001). In this context, the values observed for CD209 (3–7×10−4) are in agreement with these genome estimations, with the exception of the African sample, which showed extreme levels of diversity (26.0×10−4) because of the presence of cluster A. By contrast, the π values observed for CD209L (16–18×10−4) are at least twofold higher than average genome estimates and fall into the upper limit of the 95% CI defined by the SNP Consortium (Sachidanandam et al. 2001). This contrast in nucleotide diversity between the two genes can be explained either by a disparity in local mutation rates or by actual differences in selective pressures. However, no major differences in mutation rates (1.57×10−9 vs. 1.70×10−9) were observed between the two homologues, nor was there substantial variation in GC content, which has been positively correlated with mutation rates and levels of polymorphisms (Sachidanandam et al. 2001; Smith et al. 2002; Waterston et al. 2002; Hellmann et al. 2003). Indeed, the GC content for CD209 (53.7%) was slightly higher than that observed for CD209L (50.9%), which reinforces the idea that different selective pressures may indeed have been the driving force behind the distinct patterns of diversity observed. Second, the patterns of repeat variation in the neck region also turned out to be strikingly different between the two genes. CD209 showed levels of heterozygosity of only 2%, whereas CD209L presented an extraordinarily high level of worldwide diversity, with an overall heterozygosity of 54% (table 5 and fig. 5). Although the neck regions of both genes share 92% of nucleotide identity, nonuniform mutation rates could, again, explain the patterns observed. However, this does not seem to be the case, since mutation-rate variation should influence the number of alleles observed rather than their frequencies, which are subject either to genetic drift or to natural selection. Indeed, we observed an even higher number of repeat alleles for CD209 (eight alleles) than for CD209L (seven alleles) (table 4 and fig. 5). Overall, differences in genomic forces seem to be insufficient to explain the contrasting patterns observed at both the sequence and neck-region length variation levels; therefore, the action of differential selective pressures acting on these genes becomes the most plausible scenario. CD209: The Signature of a Functional Constraint For CD209, not only nucleotide diversity but also F ST intercontinental values (0.15) were in conformity with previous worldwide estimations (Harpending and Rogers 2000; Akey et al. 2002; Cavalli-Sforza and Feldman 2003). For frequency-spectrum–based tests, only Fay and Wu's H test detected an excess of highly frequently derived alleles for the African and East Asian samples, a picture that may be interpreted as the result of a selective sweep. However, the significantly negative value observed in Africa is, again, exclusively due to the presence of cluster A, since 22 of the 35 fixed SNPs distinguishing it from cluster B corresponded to the derived allelic status in the latter cluster. Because cluster B accounts for 85% of the African variability, a clear excess of frequently derived alleles was observed. The extent to which the presence of this cluster is due to either natural selection or population structure will be discussed in detail below. For East Asia, the significance of the H test is also questionable when accounting for the confounding effects of demography. Indeed, when we plotted our H value against the empirical distribution of 132 H values from non-African populations (Akey et al. 2004), the East Asian P value became nonsignificant (P=.36). This observation reinforces the idea that the H test is particularly sensitive to past bottlenecks and/or population subdivision (Przeworski 2002). Thus, regarding the global levels of sequence diversity, the CD209 locus seems to evolve under evolutionary neutrality. Nevertheless, when we focused our analyses at the protein level, signs of natural selection were uncovered. Indeed, the McDonald-Kreitman test rejected neutrality for this gene because of a clear excess of polymorphic synonymous sites (i.e., a lack of nonsynonymous variants). In addition, when the number of synonymous sites (146) versus nonsynonymous sites (499) was compared with the observed number of synonymous (5) versus nonsynonymous (0) mutations, we detected a significant lack of nonsynonymous mutations (two-tailed Fisher exact test, P=6.3×10−4). These observations point to a strong selective constraint acting on CD209 that prevents the accumulation of amino acid replacements over time. Further support for a functional constraint in CD209 comes from the patterns of diversity observed in the neck region. In contrast to CD209L, virtually no variation was observed at CD209 (fig. 5A), with the 7-repeat allele accounting for 99% of the total variability. Moreover, the low levels of heterozygosity observed resulted in a consistent rejection of mutation-drift equilibrium in almost all geographical regions (table 5). The probability of finding such a low heterozygosity value, given the overall number of alleles observed, was estimated to be <0.2%, independent of the mutational model considered (table 5). Thus, the fact that no alleles other than the 7-repeat allele have increased in frequency, together with recent studies addressing the functional consequences of repeat-number variation in this region (Bernhard et al. 2004; Feinberg et al. 2005), strongly suggests a clear reduced fitness of any allele other than the 7-repeat allele. Interestingly, it has been recently shown that a protein with two fewer repeats (a 5-repeat allele) results in a partial dissociation of the final tetramer, whereas a protein with <5 repeats exhibits a dramatic reduction in overall stability (Feinberg et al. 2005), with all these differences having a direct impact on the quality of ligand-binding functions (Bernhard et al. 2004). Taken together, the patterns of diversity observed at CD209 clearly point to a strong functional constraint acting on this gene and further support the proposed crucial role of this lectin in pathogen recognition and in the early steps of immune response (Geijtenbeek et al. 2000b, 2004). CD209L: Relaxation of the Functional Constraint or Balancing Selection? In clear contrast to its homologue, CD209L presented extremely elevated nucleotide-diversity levels. High levels of diversity can result either from a relaxation of the functional constraint, which allows the stochastic accumulation of new mutations, or from the action of balancing selection, which maintains over time two or more functionally different alleles (and all linked variation) at intermediate frequencies. Several lines of evidence lend support to the selective hypothesis. First, if CD209L nucleotide diversity has been driven by the action of balancing selection, population-genetics relationships would have been accordingly altered. In this context, diversity studies in neutral, or assumedly neutral, regions of the genome—such as the Y chromosome (Underhill et al. 2000; Hammer et al. 2001; Jobling and Tyler-Smith 2003), mtDNA (Wallace et al. 1999; Ingman et al. 2000; Mishmar et al. 2003), Alu insertions (Watkins et al. 2001), as well as some autosomal genes (Stephens et al. 2001; Akey et al. 2004)—showed that African populations are genetically more diverse than are non-Africans, an observation generally interpreted as a support of the “Out of Africa” model for the origin of modern humans (Lewin 1987). For CD209L, even if we observed 1.5 times more segregating sites in African than in non-African populations, as indicated by the higher θw value found in Africa, similar values of nucleotide diversity were detected in the three groups, with Europeans presenting even higher π values than do Africans. This unusual scenario, which is at odds with neutral expectations, has already been described for other regions of the genome, such as the β-globin gene and the 5′cis-regulatory region of CCR5, for which the action of balancing selection has been convincingly proposed (Harding et al. 1997; Bamshad et al. 2002). Second, balancing selection tends to increase within-population diversity while decreasing F ST, compared with neutrally evolving loci (Cavalli-Sforza 1966; Harpending and Rogers 2000; Akey et al. 2002; Bamshad and Wooding 2003; Cavalli-Sforza and Feldman 2003). Indeed, our data are compatible with these predictions, since the 5%F ST value observed for CD209L is threefold lower than that estimated for CD209 (15%) and is similar to that found, for example, for the bitter-taste receptor gene (5.6%), for which there is compelling evidence of balancing-selection action (Wooding et al. 2004). Third, results of our Tajima's D analysis were significantly positive for European and East Asian populations, because of the skew of CD209L frequency spectrum toward an excess of intermediate-frequency alleles (table 2), a pattern that further supports the action of balancing selection. However, since the null model used to assess significance makes unrealistic assumptions about past population demography (i.e., constant population sizes), the rejection of the standard neutral model cannot be interpreted as unambiguous evidence of selection. Indeed, the observation that only non-African populations showed a significant departure from neutrality raises the question of whether these patterns could have resulted instead from the bottleneck that occurred during the Out of Africa exodus. A way to circumvent this conundrum is to analytically integrate the fact that demography affects all the genome equally, whereas selection directs its effects toward specific loci. Thus, to correct for the confounding effects of demography, we plotted our results against the empirical distributions of Akey et al. (2004) for Tajima's D statistics. Our values remained significant for CD209L, which therefore reinforces the idea that the pattern observed is unlikely to be the sole result of demography. Last, if the patterns of variation in CD209L represent the molecular signature of balancing selection, at least in non-Africans, then a functional target of such selective regime is needed. In this context, the neck region constitutes an excellent candidate, since it plays a major mediating role in the orientation and flexibility of the carbohydrate-recognition domain. Since this domain is directly involved in pathogen recognition, neck-region length variation has important consequences for the pathogen-binding properties of these lectins (Mitchell et al. 2001; Bernhard et al. 2004; Feinberg et al. 2005). In perfect agreement with the results of our sequence-based data set, higher diversity in repeat variation was observed in the neck region among non-African populations (Native Americans excepted). Out of Africa, at least three alleles account for most population diversity, whereas, in Africa, the 6- and 7-repeat alleles alone account for 96% of the global variability (fig. 5B). Again, the higher diversity observed out of Africa could be due to a higher level of relaxation of the functional constraint of the neck region in non-African compared with African populations, which would lead to a random accumulation of proteins with varying neck-region lengths among non-Africans. Conversely, these patterns could also be explained by the action of balancing selection in non-Africans and could therefore point to the neck region as the functional target of such selective regime. To evaluate the plausibility of these two conflicting scenarios, we compared the variation in the CD209L neck region with that inferred from 377 neutral autosomal microsatellites typed elsewhere for the same population panel (Rosenberg et al. 2002). We reasoned that if CD209L diversity has been shaped only by demography (i.e., bottleneck out of Africa), the distribution of genetic variance at different hierarchical levels should be comparable to that inferred through the neutral markers. On the other hand, if selection has driven the CD209L neck-region diversity, population-genetics distances would be influenced accordingly and would therefore differ from neutral expectations. Indeed, the AMOVA values inferred for CD209L fell systematically outside the 95% CI defined for the microsatellite data set (table 6 ). We observed that populations within Europe, Asia, the Middle East, and Oceania exhibited lower-than-expected diversity among populations within the same region. A reduction of genetic distances between populations is expected under balancing selection; therefore, the results from the CD209L neck region favor, once again, the action of this selective regime in most non-African populations, in detriment of the neutral hypothesis. One may argue that the differences in the proportions of genetic variance between our data and those of Rosenberg et al. (2002) could be due to differences in the pace of mutation between microsatellite loci and our neck repeated region that could be considered a “coding minisatellite.” However, under neutrality, differences in mutation rate should have a similar and proportional effect in all population comparisons and should influence all values with a similar tendency (i.e., higher or lower values). Indeed, this is not the case: populations within Europe, the Middle East, Central/South Asia, East Asia, and Oceania turned out to be genetically closer than expected, whereas populations within Africa and the Americas exhibited the opposite pattern (table 6), which makes it highly unlikely that mutation-rate differences influenced our conclusions. Table 6 AMOVA for the Neck Region of CD209L AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. a Populations are grouped as described by Rosenberg et al. (2002). b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). Taken together, the integration of the results from levels of nucleotide and amino acid diversity, neutrality tests, population-genetics distances, and neck-region length variation in CD209 and CD209L clearly points to a situation in which CD209 has been under a strong selective constraint that prevents accumulation of any of amino acid changes over time, whereas CD209L variability has most likely been driven by the action of balancing selection, at least in non-African populations. The Footprints of Ancestral Population Diversity In apparent dichotomy with the strong selective constraint described for CD209, we observed an unusual excess of diversity of 35 fixed differences separating the two basal branches of the gene tree (fig. 6). In addition, we estimated a T MRCA of 2.8±0.22 MYA, a time that places the most recent common ancestor of CD209 back in the Pliocene epoch, before the estimated time for the origins of the genus Homo ∼1.9 MYA (Wood 1996; Wood and Collard 1999). A number of studies have already reported loci that present unusually deep coalescent times (Harris and Hey 1999; Zhao et al. 2000; Webster et al. 2003; Garrigan et al. 2005a, 2005b), but our estimation for CD209 remains one of the deepest T MRCA values yet reported (Excoffier 2002). The probability of finding such a deep coalescence time under a scenario of a random-mating population was estimated, through a coalescent process (Laval and Excoffier 2004), to be very low (P=.018) (see fig. 7 ). In addition to the unexpected antiquity of the CD209 locus, we observed a peculiar tree topology made of two highly divergent and frequency-unbalanced lineages, cluster A embracing only 2 internal haplotypes and cluster B comprising the remaining 23 (fig. 6). Figure 7 Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. Different hypotheses can account for such elongated and divergent haplotype patterns. Indeed, the high levels of nucleotide identity between CD209 and CD209L could have led to gene conversion between the two genes, an event that would explain the outlier position of cluster A in the context of CD209 phylogeny. We reasoned that if gene conversion has occurred, we expect that the derived alleles distinguishing clusters A and B in CD209 would correspond to the allelic state observed in their homologous positions in CD209L. Of all positions, only four fit this criterion. In addition, these positions were not physically clustered, which therefore excludes a major gene-conversion event as the explanation of the divergent CD209 phylogeny. Two other circumstances may be responsible for the topology and the time depth of the CD209 gene tree: long-standing balancing selection or ancient population structure, with Africa, in both cases, being the arena of such events (i.e., cluster A is restricted to Africa). Several lines of evidence argue against the balancing-selection hypothesis. First, under this selective regime, one would expect that Tajima's D test would also point in this direction by yielding significantly positive values, which is not the case (table 2). Second, such a long-standing balancing selection in Africa would have entailed a number of recombinant haplotypes between clusters A and B, which, again, is not the case, as illustrated by the high LD levels at CD209 (fig. 3). Third, a claim of balancing selection at this locus must imply a functional difference between the two balanced alleles. Indeed, three nonsynonymous mutations, situated in the neck region, separate cluster A and B, and they could correspond to the alleles under selection. But, if the neck region is the target of selection, it is more likely that the balanced alleles would correspond to different numbers of repeats rather than punctual nucleotide variation within each track, as observed for CD209L and suggested by functional studies (Bernhard et al. 2004; Feinberg et al. 2005). Since no variation in the number of repeats was detected between both clusters, we predict that there are no major functional differences between the two lineages. Taken together, maintenance of ancient lineages by balancing selection does not seem to be responsible for the observed haplotype divergence. In this view, the patterns observed are best explained by an ancestral population structure on the African continent. Indeed, several studies have already proposed that African populations must have been more strongly subdivided and isolated than non-African ones (Harris and Hey 1999; Labuda et al. 2000; Excoffier 2002; Goldstein and Chikhi 2002; Harding and McVean 2004; Satta and Takahata 2004; Garrigan et al. 2005a). In particular, a recent study of the Xp21.1 locus presented convincing statistical evidence that supports the hypothesis that our species does not descend from a single, historically panmictic population (Garrigan et al. 2005a). The divergent haplotype pattern observed at the Xp21.1 locus prompted those authors to explain their data under the isolation-and-admixture (IAA) model and/or a metapopulation model (Harding and McVean 2004; Wakeley 2004). Indeed, as observed for CD209, under an IAA model, the two basal branches are expected to be longer than those under a Wright-Fisher model, depending on the length of time subpopulations spent in isolation. The extent to which the IAA model fits the data depends on the number of mutations, referred as to “congruent sites,” occurring in the two basal branches of the genealogy. For Xp21.1, 10 congruent sites over 24 polymorphisms were observed (i.e., ∼42% of the total number of sites). We applied the same approach to CD209 and obtained a very similar percentage of ∼45%, in good accordance with the IAA model. Our observations, together with a number of autosomal diversity studies, show that modern human diversity appears to have kept genetic traces of admixture among archaic hominid populations. However, a number of questions remain unanswered, such as the time when these admixture events occurred (i.e., before or after the appearance of anatomically modern humans), the precise quantitative contribution of ancient genetic material to our modern gene pool, and the geographic provenance of these genetic vestiges. Conclusions The need of continuous evolution for both the human host and the pathogens is predicted by the Red Queen hypothesis (Van Valen 1973; Bell 1982), in reference to the remark of the Red Queen to Alice in Through the Looking Glass (Carroll 1872): “Now, here, you see, it takes all the running you can do, to keep in the same place.” This metaphor provides a conceptual framework for understanding how interactions between the two species lead to constant natural selection for adaptation and counteradaptation. In this context, one feature exploited by the host immunity genes to increase their defense potential is gene duplication by retention, through conservation of one duplicate, of the currently useful function of the encoded protein, while its twin is liberated to mutate and possibly acquire novel functions (Ohno 1970; Trowsdale and Parham 2004). The lectins CD209 and CD209L represent a prototypic model of a duplicated progeny of ancestral genes that interact with a vast spectrum of pathogens. Our results clearly indicate that these duplicated genes have evolved, and might still evolve, under completely different evolutionary pressures. Whereas one, CD209, shows signals of strong conservation, its paralogue, CD209L, exhibits an excess of sequence diversity compatible with the action of balancing selection. In addition, the strong contrast observed in length variation of the neck region between the two genes may have important consequences in medical genetics. In this context, association studies are now needed that correlate length variation of the neck region and susceptibility to infectious diseases whose etiological agents are known to interact with one (or both) of these lectins. More generally, our study has revealed that even a short segment of the human genome can help uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host immunity genes, as well as traces of ancient population structure in the African continent. The coming years will certainly bring unprecedented large data sets of sequence diversity, genomewide and populationwide, with each genomic region possibly revealing a different aspect of human history. The integration of all these apparently independent pieces of the same reality will provide us with a much broader and more realistic view of the demographic history of the human species, as well as of human adaptation to the different environmental conditions imposed not only by pathogens but also by other major factors such as climate and nutritional resources. Acknowledgments We warmly acknowledge Guillaume Laval for useful suggestions on the use of SIMCOAL software, Laurent Excoffier and Francesca Luca for stimulating discussions, and two reviewers for constructive comments on the first version of the manuscript. L.B.B. was supported by Fundação para a Ciência e a Tecnologia fellowship SFRH/BD/18580/2004. The URLs for data presented herein are as follows: Arlequin, http://lgb.unige.ch/arlequin/ BOTTLENECK, http://www.montpellier.inra.fr/CBGP/softwares/bottleneck/bottleneck.html Center for Statistical Genetics, http://www.sph.umich.edu/csg/abecasis/GOLD/ (for GOLD software) Centre National de Genotypage, http://software.cng.fr/ (for GENALYS software) DnaSP, http://www.ub.es/dnasp/ GENETREE Software, http://www.stats.ox.ac.uk/∼griff/software.html HGDP-CEPH Human Genome Diversity Cell Line Panel, http://www.cephb.fr/HGDP-CEPH-Panel/ Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for dendritic cell–specific ICAM-3 grabbing nonintegrin and liver/lymph node–specific ICAM-3 grabbing nonintegrin) Phase, http://www.stat.washington.edu/stephens/phase.html SIMCOAL2, http://cmpg.unibe.ch/software/simcoal2/ |
Document structure show
article-title | The Heritage of Pathogen Pressures and Ancient Demography in the Human Innate-Immunity CD209/CD209L Region |
abstract | The innate immunity system constitutes the first line of host defense against pathogens. Two closely related innate immunity genes, CD209 and CD209L, are particularly interesting because they directly recognize a plethora of pathogens, including bacteria, viruses, and parasites. Both genes, which result from an ancient duplication, possess a neck region, made up of seven repeats of 23 amino acids each, known to play a major role in the pathogen-binding properties of these proteins. To explore the extent to which pathogens have exerted selective pressures on these innate immunity genes, we resequenced them in a group of samples from sub-Saharan Africa, Europe, and East Asia. Moreover, variation in the number of repeats of the neck region was defined in the entire Human Genome Diversity Panel for both genes. Our results, which are based on diversity levels, neutrality tests, population genetic distances, and neck-region length variation, provide genetic evidence that CD209 has been under a strong selective constraint that prevents accumulation of any amino acid changes, whereas CD209L variability has most likely been shaped by the action of balancing selection in non-African populations. In addition, our data point to the neck region as the functional target of such selective pressures: CD209 presents a constant size in the neck region populationwide, whereas CD209L presents an excess of length variation, particularly in non-African populations. An additional interesting observation came from the coalescent-based CD209 gene tree, whose binary topology and time depth (∼2.8 million years ago) are compatible with an ancestral population structure in Africa. Altogether, our study has revealed that even a short segment of the human genome can uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host genes as well as traces of admixture among archaic hominid populations. |
p | The innate immunity system constitutes the first line of host defense against pathogens. Two closely related innate immunity genes, CD209 and CD209L, are particularly interesting because they directly recognize a plethora of pathogens, including bacteria, viruses, and parasites. Both genes, which result from an ancient duplication, possess a neck region, made up of seven repeats of 23 amino acids each, known to play a major role in the pathogen-binding properties of these proteins. To explore the extent to which pathogens have exerted selective pressures on these innate immunity genes, we resequenced them in a group of samples from sub-Saharan Africa, Europe, and East Asia. Moreover, variation in the number of repeats of the neck region was defined in the entire Human Genome Diversity Panel for both genes. Our results, which are based on diversity levels, neutrality tests, population genetic distances, and neck-region length variation, provide genetic evidence that CD209 has been under a strong selective constraint that prevents accumulation of any amino acid changes, whereas CD209L variability has most likely been shaped by the action of balancing selection in non-African populations. In addition, our data point to the neck region as the functional target of such selective pressures: CD209 presents a constant size in the neck region populationwide, whereas CD209L presents an excess of length variation, particularly in non-African populations. An additional interesting observation came from the coalescent-based CD209 gene tree, whose binary topology and time depth (∼2.8 million years ago) are compatible with an ancestral population structure in Africa. Altogether, our study has revealed that even a short segment of the human genome can uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host genes as well as traces of admixture among archaic hominid populations. |
body | Introduction Infectious diseases have been paramount among the threats to health and survival for most of human evolutionary history (Haldane 1949; Lederberg 1999; Harpending and Rogers 2000; Cooke and Hill 2001). The interaction of the human host with a wide variety of pathogens has been accompanied by genetic adaptations to spatially and temporally fluctuating selective pressures imposed by the infectious agents. Numerous studies have sought the genetic imprint of natural selection imposed by pathogen pressures in human genes involved in immune response or, more generally, in host-pathogen interactions (Vallender and Lahn 2004). For example, natural selection has acted on such genes as MHC, β-globin, G6PD, IL-2, IL-4, TNFSF5, the Duffy blood group genes, and CCR5 (Ohta 1991; Hughes et al. 1994; Flint et al. 1998; Hamblin and Di Rienzo 2000; Tishkoff et al. 2001; Bamshad et al. 2002; Sabeti et al. 2002; Verrelli et al. 2002). However, little is known about genetic variation of genes involved in direct recognition of pathogens, or pathogens' products, and virtually no studies have investigated the extent to which pathogens have exerted selective pressures on the innate immune system. The phylogenetically ancient innate immune system governs the initial detection of pathogens and stimulates the first line of host defense (Medzhitov and Janeway 1998a, 2000, 2002; Janeway and Medzhitov 2002). Recognition of pathogens is mediated by phagocytic cells through germline-encoded receptors, known as “pattern recognition receptors,” which detect pathogen-associated molecular patterns that are characteristic products of microbial physiology (Kimbrell and Beutler 2001; Janeway and Medzhitov 2002). This initial interaction is then translated into a set of endogenous signals that ultimately lead to the induction of the adaptive immune response (Medzhitov and Janeway 1998b). In recent years, the C-type lectin receptors have received much attention in the area of innate immunology, the results of which were novel functional insights into the primary interface between host and pathogens (Medzhitov 2001; Cook et al. 2003; Fujita et al. 2004; Geijtenbeek et al. 2004; McGreal et al. 2004). In this context, two prototypic members of the C-type lectin–receptor family are particularly interesting, since they can act as both cell-adhesion receptors and pathogen-recognition receptors. These lectins include CD209 (DCSIGN: dendritic cell–specific ICAM-3 grabbing nonintegrin [MIM 604672]) and its close relative CD209L (L-SIGN: liver/lymph node–specific ICAM-3 grabbing nonintegrin [MIM 605872]) (Curtis et al. 1992; Geijtenbeek et al. 2000b, 2004; Soilleux et al. 2000; Pohlmann et al. 2001). These lectin-coding genes are located on chromosome 19p13.2-3, within an ∼26-kb segment, and result from a duplication of an ancestral gene (Bashirova et al. 2003; Soilleux 2003). An additional characteristic of both CD209 and CD209L is the presence of a neck region, primarily made up of seven highly conserved 23-aa repeats, that separates the carbohydrate-recognition domain involved in pathogen binding from the transmembrane region. This neck region presents high nucleotide identity between repeats, both within each molecule and between CD209 and CD209L. It has been shown that this region plays a crucial role in the oligomerization and support of the carbohydrate-recognition domain; therefore, it influences the pathogen-binding properties of these two receptors (Soilleux et al. 2000, 2003; Feinberg et al. 2005). In regard to expression profiles, CD209 is expressed primarily on phagocytic cells, such as dendritic cells and macrophages, whereas CD209L expression is restricted to endothelial cells in liver and lymph nodes (Bashirova et al. 2001; Soilleux et al. 2001, 2002). As pathogen-recognition receptors, the two lectins have been shown to recognize a vast range of microbes, some of which are of major public health importance (Geijtenbeek et al. 2004). Indeed, CD209 captures bacteria such as Mycobacterium tuberculosis, Helicobacter pylori, and certain Klebsiela pneumonia strains; viruses such as HIV-1, Ebola virus, cytomegalovirus, hepatitis C virus, Dengue virus, and SARS-coronavirus; and parasites like Leishmania pifanoi and Schistosoma mansoni (Geijtenbeek et al. 2000a, 2003; Alvarez et al. 2002; Colmenares et al. 2002; Halary et al. 2002; Appelmelk et al. 2003; Lozach et al. 2003; Tailleux et al. 2003; Tassaneetrithep et al. 2003; Bergman et al. 2004; Marzi et al. 2004). With regard to CD209L, studies to date have shown an interaction with a variety of viruses, including HIV, hepatitis C, Ebola, and coronavirus, as well as with the parasite Schistosoma mansoni (Bashirova et al. 2001; Alvarez et al. 2002; Gardner et al. 2003; Jeffers et al. 2004; Van Liempt et al. 2004). In this context, the efficiency of the two lectins in pathogen recognition and subsequent processing may have important consequences for the quality of host immune responses and consequent pathogen control and/or clearance. An important step forward in the understanding of human adaptation to pathogens and control of infectious diseases includes the description of quality and quantity of genetic variation in genes involved in host recognition of infectious agents. Given the direct interaction of CD209 and CD209L with a large variety of pathogens, the CD209/CD209L genomic region provides an excellent model system to illustrate the extent to which pathogens have exerted selective pressures on host immunity genes. An additional feature that makes these genes highly interesting in evolutionary studies is that they are likely to have been influenced by similar genomic forces (recombination, mutation rates, etc.) because of their close physical proximity (∼15 kb), high nucleotide (73%) and amino acid (77%) identity, and identical exon-intron organization (Soilleux 2003) (fig. 1 ). In addition, it has been proposed that gene duplication of immunity genes is a molecular strategy developed by the host to enlarge its defense potential (Ohno 1970; Trowsdale and Parham 2004). A number of immune-system gene families have evolved, by gene duplication followed by natural selection, to provide responses to a wider range of pathogens, with welldocumented examples in immunoglobulin and MHC genes (Hughes et al. 1994; Ota et al. 2000). In this context, duplicated genes in cis, like CD209 and CD209L, may have undergone differential selective pressures to enlarge the defense role of these lectins. To address these complex issues, we performed a sequence-based survey of the entire CD209/CD209L region in a panel of individuals of different ethnic origins. Here, we report evidence showing that these two closely related innate immunity genes have gone through completely different evolutionary processes that are reflected in their current patterns of diversity. In addition, our study provides novel insights into how pathogens have shaped the patterns of variability of immunity genes resulting from gene duplication. Figure 1 Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown. Material and Methods Population Samples Sequence variation of the CD209/CD209L region was determined in 41 sub-Saharan Africans, 43 Europeans, and 43 East Asians, in a total of 254 chromosomes from the Human Genome Diversity Panel (HGDP)–CEPH panel (Cann et al. 2002). More-detailed information about the composition of the three major ethnic groups can be found in table 1 . The variation in the repeat number of the neck region of CD209 and CD209L was defined in the entire HGDP-CEPH panel, comprising 1,064 DNA samples from 52 worldwide populations. In addition, the orthologous regions for both genes were sequenced in four chimpanzees (Pan troglodytes). Table 1 Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity Location and Population Geographic Origin No. of Chromosomes Africa: Biaka Pygmies Central African Republic 12 Mbuti Pygmies Democratic Republic of Congo 10 Bantu Fanga Gabon 12 Bantu, northeastern Kenya 10 Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 Mandenka Senegal 12 San Namibia 8 Yoruba Nigeria 10 Europe: Adygei Russian Caucasus 4 French France 16 French Basque France 12 Sardinian Italy 12 Russian Russia 12 Orcadian Orkney Islands 14 North Italian Italy (Bergamo) 16 East Asia: Japanese Japan 42 Han China 4 Tujia China 4 Yizu China 4 Miaozu China 4 Orogen China 4 Daur China 4 Mongola China 4 Hezhen China 4 Xibo China 4 Uygur China 4 Dai China 4 a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. Molecular Analyses The sequenced fragments of the CD209/CD209L genomic region are shown in figure 1. The entire CD209 region—including exons, introns, and ∼1 kb of the 5′ UTR corresponding to the promoter region—was sequenced, for a total of 5.5 kb per individual. For CD209L, we sequenced a total of ∼5.4 kb per individual, following the same approach used for CD209, with the exception of the neck region. That region was genotyped for its number of repeats, since it turned out to be highly polymorphic, which prevented the sequencing process. Genotyping was performed by a single PCR amplification followed by migration in 2% agarose gels. Human primers were used to both amplify and sequence the orthologous regions in chimpanzees. However, because of polymorphisms specific to the chimpanzee lineage, we could not obtain the entirety of the sequence. Thus, 4.9 kb (90% of the total) of the chimpanzee CD209 sequence were obtained, and 5.3 kb (98% of the total) of CD209L. Detailed information on primer sequences and PCR amplification conditions is available on request. All nucleotide sequences were obtained using the Big Dye terminator kit and the 3100 automated sequencer from Applied Biosystems. Sequence files and chromatograms were inspected using the GENALYS software (Takahashi et al. 2003; Centre National de Genotypage). As a measure of quality control, when new mutations were identified in primer binding regions, new primers were designed and sequence reactions were repeated, to avoid allele-specific amplification. All singletons observed in our data set were systematically reamplified and resequenced. Statistical Analyses On the basis of the levels of diversity observed in the CD209/CD209L genomic region, we calculated the average number of pairwise differences (π) and the Watterson's estimator (θw) (Watterson 1975). Under the standard neutral model of a randomly mating population of constant size, these are unbiased estimators of the population mutation rate θ=4Neμ, where N e is the diploid effective population size and μ is the mutation rate per generation per site. To test whether the frequency spectrum of mutations conformed to the expectations of this standard neutral model, we calculated Tajima's D (Tajima 1989) and Fay and Wu's H tests (Fay and Wu 2000). P values for the different tests were estimated from 104 coalescent simulations under an infinite-site model, with use of a fixed number of segregating sites and the assumption of no recombination, which has been shown to be a conservative assumption (Gilad et al. 2002). In parallel, we estimated P values for all these tests, using the empirical distribution obtained from sequencing data of 132 genes in a panel of 24 African Americans and 23 European Americans (Akey et al. 2004). All these analyses, together with the interspecies McDonald-Kreitman (McDonald and Kreitman 1991) and K A/K S (Kimura 1968) tests, were performed using the DnaSP package (Rozas et al. 2003). Genetic distances between populations (F ST) and heterozygosity values were estimated using the Arlequin package (Schneider et al. 2000). F ST statistical significance was assessed using 10,000 bootstrap replications. To bear out a deficit or an excess of heterozygosity in the neck region of CD209 and CD209L, we used BOTTLENECK (Cornuet and Luikart 1996) to compute for each geographic region, the distribution of the heterozygosity expected from the observed number of alleles, given the sample size (n) under the assumption of mutational-drift equilibrium. This distribution was obtained through simulation of the coalescent process of n genes under two mutational models, the infinite-site model and the stepwise mutation model. In addition, to obtain information on the fraction of genetic variance in the neck region that is due to intra- and interpopulation differences, we performed an analysis of molecular variance (AMOVA), using the Arlequin package (Schneider et al. 2000). The AMOVA results were compared with those of 377 microsatellites analyzed in the same population panel (Rosenberg et al. 2002). Haplotype reconstruction was performed by use of the Bayesian statistical method implemented in Phase (v.2.1.1) (Stephens and Donnelly 2003). We applied the algorithm five times, using different randomly generated seeds, and consistent results were obtained across runs. After haplotype reconstruction, linkage disequilibrium (LD) between pairs of SNPs was computed using Lewontin's D′ index (Lewontin 1964). For this analysis, only markers presenting a minimum allele frequency (MAF) of 10% were considered, since rare alleles have been shown to present a higher probability of being in significant LD than do common ones (Reich et al. 2001). The graphic display of the LD plots was constructed using GOLD (Abecasis and Cookson 2000; Center for Statistical Genetics). To support the existence of a recombination hotspot in the region under study, we used the hotspot-recombination model implemented in Phase (v.2.1.1). Under this model, we assumed that there was, at most, one hotspot of unknown position. We then estimated the background population-recombination rate (ρ) and the relative intensity of any recombination hotspot. To obtain better estimates, we increased 10 times the number of iterations of the final run of the algorithm. All our estimations were obtained by averaging results of five independent runs with use of different seed numbers. Since the model used is Bayesian, we could also estimate, for each population, the posterior probability of a hotspot of intensity >1 (λ>1) and >10 (λ>10). We obtained the gene tree and estimated the time of the most recent common ancestor (T MRCA) for CD209, using the maximum-likelihood coalescent method implemented in GENETREE (Griffiths and Tavare 1994). The mutation rate μ for each gene was estimated on the basis of the net divergence between humans and chimpanzees and under the assumption both that the species separation occurred 5 million years ago (MYA) and of a generation time of 20 years. Using this μ and θ maximum likelihood (θML), we estimated the effective population size parameter (N e). With the assumption of a generation time of 20 years and the estimated N e, the coalescence time, scaled in 2Ne units, was converted into years. The coalescence process implemented in SIMCOAL2 (Laval and Excoffier 2004) allowed us to estimate the probability of the T MRCA for CD209, through 2×104 simulations, with use of both the number of observed segregating sites and the estimated N e . Results We determined sequence diversity in the CD209 and CD209L genes (fig. 1) as well as length variation of the neck region in 254 chromosomes originating from three major ethnic groups: sub-Saharan Africans, Europeans, and East Asians. In addition, the orthologous sequences were obtained in four chimpanzees, to infer the ancestral state at each site, to estimate the divergence between humans and chimpanzees, and to perform a number of interspecies neutrality tests. Patterns of Nucleotide and Haplotype Diversity in the CD209/CD209L Region For CD209, we identified a total of 79 SNPs and 2 indels, including 5 nonsynonymous, 5 synonymous, and 71 noncoding variants. The five nonsynonymous SNPs were all located in the neck region (exon 4): SNPs 1839 (Arg→Gln), 1888 (Glu→Asp), and 1908 (Arg→Gln) achieved a frequency of ∼15%, and SNP 1970 (Leu→Val), a frequency of 6%. These mutations were restricted to the African sample. SNP 1472 (Ala→Thr) was observed as a singleton in an East-Asian individual. For CD209L, we identified 64 SNPs and 2 indels, including 4 nonsynonymous and 62 noncoding variants. The four nonsynonymous variants were located in different exons: SNP 141 (Thr→Ala) in exon 2, SNP 3476 (Asp→Asn) in exon 5, SNP 4268 (Thr→Ala) in exon 6, and SNP 5580 (Arg→Gln) in exon 7. All these mutations were singletons except SNP 3476, which presented high frequencies for its derived allele in all geographic regions: 97.6% in Africans, 57% in Europeans, and 77% in East Asians. All variable sites were in Hardy-Weinberg equilibrium for both CD209 and CD209L, after Bonferroni correction for multiple testing. The allelic composition of CD209 and CD209L haplotypes and their frequency distribution in the three major ethnic groups is illustrated in figure 2 , along with the haplotype composed of the ancestral allelic state of each SNP inferred from chimpanzee data. For CD209, we identified 42 different haplotypes, with an overall heterozygosity of 84% (table 2 ). Three major haplotypes (H2, H29, and H40) accounted for ∼50% of the African variability, whereas they were at very low frequency (H2 at ∼5%) or absent (H29 and H40) in Europeans and East Asians (fig. 2A). In turn, the two haplotypes (H1 and H3) that accounted for 58% and 83% of the European and East Asian variability, respectively, were observed at very low frequency (H1 at 6%) or even absent (H3) in Africa. However, H3, which had a frequency of 36% and 20% in Europe and East Asia, respectively, is just a one-step mutation (SNP 871) from H2, the most frequent haplotype in the African sample. The most interesting observation of the CD209 haplotype variability was the presence of a highly divergent haplotype cluster. This cluster, which contains haplotypes 40–42 (referred to here as “cluster A”), differs from all other haplotypes (referred to here as “cluster B”) by 35 fixed positions (fig. 2A). Cluster A is Africa specific and is present at a frequency of ∼15%, whereas cluster B is present in the remaining African and all non-African samples. It is worth noting that three (SNPs 1839, 1888, and 1908) of the five nonsynonymous mutations identified for this gene are unique to cluster A. In all cases, these three mutations were segregating together, with the exception of one haplotype, H41, which does not contain the SNP 1839. Samples from cluster A are geographically widespread over the entire African continent (i.e., two San from Namibia, three Bantus from Gabon and two from South Africa, three Yorubans from Nigeria, and two Mandenka from Senegal). For CD209L, 74 different haplotypes were observed (fig. 2B), with an overall heterozygosity of 94% (table 2). Only one haplotype (H38) at a frequency of ∼15% was shared in the three continental regions. Figure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. Table 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H CD209: African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 CD209L: African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 Note.— The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. a HD = haplotype diversity (%). b Nucleotide diversity per base pair (×10−4). c Watterson's estimator per base pair (×10−4). d .02<P≤.05. e P≤.02. To assess the degree of population differentiation, if any, we computed Wright's F ST (Wright 1931), using haplotype frequencies. F ST estimates were significant (P<.0001) for all population comparisons, indicating continental differentiation for both CD209 and CD209L. However, substantial differences were observed between the two genes: the overall F ST for CD209 among Africans, Europeans, and East-Asians was 0.15, whereas CD209L presented a threefold lower F ST value of 0.05. For both genes, the larger F ST values were observed between African and East Asian populations, with F ST values of 0.22 for CD209 and 0.07 for CD209L. Levels of Polymorphism and Divergence between Humans and Chimpanzees The average nucleotide diversity (π) was strikingly different, both between the two genes and among populations (table 2). Globally, π values were three- to fivefold lower for CD209 (3–7×10−4) than for CD209L (∼16 × 10−4), except for African populations, for whom the CD209 π value was unusually high (26×10−4) because of the presence of the highly divergent cluster A. Indeed, when cluster A was excluded from the analysis, the African π value dropped to 8×10−4. To estimate the substitution rate of each region and evince possible mutational differences that could explain the strong contrast observed in nucleotide-diversity patterns, we determined the human-chimpanzee divergence for both genes. The average net number of differences between the two species was 77.3 substitutions (or 0.0157 substitutions per nucleotide) for CD209 and 90.6 substitutions (or 0.0171 substitutions per nucleotide) for CD209L. Since the human-chimpanzee speciation occurred 5 MYA, we obtained similar nucleotide-substitution rates per site per year (CD209, 1.57×10−9; CD209L, 1.70×10−9). LD To assess the patterns of LD in the CD209/CD209L region, haplotypes for the entire genomic region were reconstructed using markers with an MAF of 10%. D′ measures among these markers were estimated for African and non-African populations independently; the graphical representation of LD levels is illustrated in figure 3 . Two distinct regions, which correspond to either CD209 or CD209L, showed strong LD and are separated by a boundary that corresponds to the intergenic region. For CD209, a block of intragenic LD was observed in both African and non-African populations. For the African sample, 89% of all pairwise comparisons indicated significant levels of LD, whereas, for non-Africans, all D′ pairwise comparisons were significant. The magnitude of intragenic recombination (and/or gene conversion) of CD209L was slightly higher than for CD209. Nevertheless, considerable and significant levels of LD were observed between sites: 83% of all LD pairwise comparisons were significant in the African group, and 99% were in the non-African sample. Overall, CD209 exhibited a blocklike structure in both groups, whereas CD209L presented lower—although mostly significant—LD levels, in particular among the non-African sample. Figure 3 Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. The strong decay in LD observed in the intergenic region (fig. 3), which spans only ∼14 kb, suggests the occurrence of a number of recombination events. To test the hypothesis of a possible recombination hotspot situated within this region, recombination parameters across the entire CD209/CD209L region (∼26 kb) were computed for the three populations, by use of the recombination model implemented in Phase (v.2.1.1) (fig. 4 ). This model (Stephens and Donnelly 2003) estimates the position and relative intensity of the hotspot (λ) as compared with the background population recombination rate (ρ) (see the “Material and Methods” section). A λ value of 1 corresponds to absence of recombination-rate variation, whereas λ values >1 indicate the presence of a hotspot. The model detected the occurrence of a hotspot in the intergenic region, with Africans presenting a λ of 18, whereas Europeans and East Asians exhibited λ values of 63 and 53, respectively (fig. 4). We estimated the posterior probabilities of a hotspot of any kind, Pr(λ>1), and of at least 10 times the background recombination rate, Pr(λ>10). Pr(λ>1) was 100% for all population groups, and Pr(λ>10) was 64% for Africans, 97% for Europeans, and 92% for East Asians. Thus, our data clearly indicate a relative increase of the recombination levels between the two genes, which suggests the occurrence of a hotspot of recombination, the magnitude of which varies among the major ethnic groups. However, our data do not include intergenic SNPs; therefore, the exact location and width of the recombination hotspot within the intergenic region remains unclear, since this observation would be consistent with either an intense narrow hotspot or a weaker but wider hotspot. Figure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. Neutrality Tests The identification of a strong decay in LD between CD209 and CD209L facilitated the interpretation of neutrality tests, because the noise introduced by hitchhiking effects between the genes is reduced. We applied Tajima's D and Fay and Wu's H tests to determine whether these statistics significantly deviated from expectations under neutrality, using both coalescent simulations and the empirical distribution obtained from Akey et al. (2004). Globally, Tajima's D test indicated different tendencies for the two genes (table 2). CD209 always yielded negative values for Tajima's D but never achieved significance to reject the hypothesis of neutrality, whereas CD209L yielded significantly positive values for non-African populations, with use of both coalescent simulations and the empirical distribution. For Fay and Wu's H test, the hypothesis of neutrality was rejected for CD209 in the African and East Asian samples (table 2). To evaluate the selective pressures at the protein level, we performed two interspecies tests: K A/K S, which gives the ratio of nonsynonymous and synonymous changes between species, and the McDonald-Kreitman test, which tests the null hypothesis that the ratio of the number of fixed differences to polymorphisms is the same for both nonsynonymous and synonymous mutations. For the K A/K S test, CD209 and CD209L showed similar values, 0.34 and 0.37, respectively. For the McDonald-Kreitman test, the hypothesis of neutrality was rejected for only CD209, because of a clear lack of nonsynonymous polymorphic sites (table 3 ). Table 3 McDonald-Kreitman Test Results No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. a Mutations in introns are considered synonymous. Neck-Region Length Variation in Worldwide Populations The identical genomic organization of CD209 and CD209L is extended to the neck region, which, in both genes, encodes a track of seven coding repeats of 23 aa each (fig. 1) (Soilleux et al. 2000). A previous study has shown that the length of the neck region of CD209L varied between individuals of European descent (Bashirova et al. 2001). To investigate the degree of polymorphism of the neck region in both CD209 and CD209L, we genotyped it in the entire HGDP-CEPH panel (1,064 individuals from 52 worldwide populations). Striking differences were observed between the two genes (see fig. 5 and table 4 for detailed allele frequencies in each population). For CD209, virtually no variation was observed, and the 7-repeat allele accounted for 99% of the total variability. Despite this limited variation, eight different alleles were observed, with an allele size range of 2–10 repeats, not including a 9-repeat allele. The geographic region that presented the highest variability was the Middle East, with five of the eight different alleles observed (fig. 5A and table 4). For CD209L, a completely different pattern emerged, with strong variation in allelic frequencies of different repeat numbers. Of the seven alleles observed (from 4–10-repeat allele size classes), the three most common overall were the 7- (57.42%), the 5- (23.92%), and the 6- (11.37%) repeat alleles. European, Asian, and Pacific populations presented a mosaic composition of different allelic classes, whereas 7- and 6-repeat alleles accounted for most (96%) of the African diversity (fig. 5B). The strong difference in the neck-region lengths between the two genes was consequently visible in the heterozygosity values: CD209 exhibited an overall heterozygosity of only 2%, whereas CD209L presented a value of 54% (table 5 table 5). Our results showed that the levels of heterozygosity observed at CD209 were considerably lower than expected, regardless of the mutation model considered (i.e., Infinite Site or Stepwise Mutation Models) (table 5). In strong contrast, although not statistically significant for individual populations, CD209L exhibited a pattern of an excess of heterozygosity in all populations. Figure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. Table 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 a Heterozygosity values. b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. Table 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. a Under the infinite-site model. b Probability of the observed heterozygosity under the infinite-site model. c Probability of the observed heterozygosity under the stepwise mutational model. Time of the Most Recent Common Ancestor for CD209 The low levels of intragenic recombination observed in CD209 allowed maximum-likelihood coalescent analysis (Griffiths and Tavare 1994) for estimation of the time scale of the origin and evolution of this gene. Since this method assumes an infinite-site model without recombination, the same analysis for CD209L was not conducted because of the substantial amount of recombinant haplotypes observed. For CD209, only 29 of the 254 chromosomes analyzed had to be excluded, as did a single segregating site (SNP 939). The resulting CD209 gene tree estimate, rooted with the chimpanzee sequence (i.e., the chimpanzee sequence was used to define ancestral/derived status of human mutations), is shown in figure 6 . The tree is partitioned into two deep branches that correspond to haplotype clusters A and B. African samples were observed in both sides of the deepest node of the tree (i.e., in both clusters A and B), whereas non-African samples are restricted to one branch of the tree (i.e., cluster B). The maximum-likelihood estimate of θ (θML) for CD209 was 8.4. On the basis of this θML value and the estimated mutation rate (1.54×10−4 per gene per generation), the effective population size (N e) was 13,636, a value comparable to most figures reported in the literature (for a review, see Tishkoff and Verrelli [2003]). The T MRCA of the CD209 tree was then estimated at 2.8±0.22 MYA, one of the oldest T MRCA values estimated so far in the human genome (Excoffier 2002). Figure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. Discussion The CD209/CD209L region possesses a number of characteristics that make it a powerful tool for evolutionary inference. These two genes are not in LD, despite their very close physical vicinity (∼15 kb), and each of them behaves as an independent genetic entity. Moreover, our results suggest that the CD209/CD209L region is a uniform landscape of genomic forces, since the two lectin-coding genes present similar mutation rates, as well as high nucleotide identity and conserved exon-intron organization (fig. 1). Contrasting Patterns of Diversity in the CD209/CD209L Region Our diversity study revealed completely different patterns for the two genes. First, levels of nucleotide diversity (π) were found to be much lower for CD209 than for CD209L (table 2). On the basis of 1.42 million SNPs, the International SNP Map Working Group defined 7.5×10−4 as the average value of nucleotide diversity for the human genome and showed that 95% of all bins presented π values varying from 2.0×10−4 to 15.8×10−4 (Sachidanandam et al. 2001). In addition, an independent study analyzed nucleotide and haplotype diversity for 313 genes and defined the average π value as 5.4×10−4 (Stephens et al. 2001). In this context, the values observed for CD209 (3–7×10−4) are in agreement with these genome estimations, with the exception of the African sample, which showed extreme levels of diversity (26.0×10−4) because of the presence of cluster A. By contrast, the π values observed for CD209L (16–18×10−4) are at least twofold higher than average genome estimates and fall into the upper limit of the 95% CI defined by the SNP Consortium (Sachidanandam et al. 2001). This contrast in nucleotide diversity between the two genes can be explained either by a disparity in local mutation rates or by actual differences in selective pressures. However, no major differences in mutation rates (1.57×10−9 vs. 1.70×10−9) were observed between the two homologues, nor was there substantial variation in GC content, which has been positively correlated with mutation rates and levels of polymorphisms (Sachidanandam et al. 2001; Smith et al. 2002; Waterston et al. 2002; Hellmann et al. 2003). Indeed, the GC content for CD209 (53.7%) was slightly higher than that observed for CD209L (50.9%), which reinforces the idea that different selective pressures may indeed have been the driving force behind the distinct patterns of diversity observed. Second, the patterns of repeat variation in the neck region also turned out to be strikingly different between the two genes. CD209 showed levels of heterozygosity of only 2%, whereas CD209L presented an extraordinarily high level of worldwide diversity, with an overall heterozygosity of 54% (table 5 and fig. 5). Although the neck regions of both genes share 92% of nucleotide identity, nonuniform mutation rates could, again, explain the patterns observed. However, this does not seem to be the case, since mutation-rate variation should influence the number of alleles observed rather than their frequencies, which are subject either to genetic drift or to natural selection. Indeed, we observed an even higher number of repeat alleles for CD209 (eight alleles) than for CD209L (seven alleles) (table 4 and fig. 5). Overall, differences in genomic forces seem to be insufficient to explain the contrasting patterns observed at both the sequence and neck-region length variation levels; therefore, the action of differential selective pressures acting on these genes becomes the most plausible scenario. CD209: The Signature of a Functional Constraint For CD209, not only nucleotide diversity but also F ST intercontinental values (0.15) were in conformity with previous worldwide estimations (Harpending and Rogers 2000; Akey et al. 2002; Cavalli-Sforza and Feldman 2003). For frequency-spectrum–based tests, only Fay and Wu's H test detected an excess of highly frequently derived alleles for the African and East Asian samples, a picture that may be interpreted as the result of a selective sweep. However, the significantly negative value observed in Africa is, again, exclusively due to the presence of cluster A, since 22 of the 35 fixed SNPs distinguishing it from cluster B corresponded to the derived allelic status in the latter cluster. Because cluster B accounts for 85% of the African variability, a clear excess of frequently derived alleles was observed. The extent to which the presence of this cluster is due to either natural selection or population structure will be discussed in detail below. For East Asia, the significance of the H test is also questionable when accounting for the confounding effects of demography. Indeed, when we plotted our H value against the empirical distribution of 132 H values from non-African populations (Akey et al. 2004), the East Asian P value became nonsignificant (P=.36). This observation reinforces the idea that the H test is particularly sensitive to past bottlenecks and/or population subdivision (Przeworski 2002). Thus, regarding the global levels of sequence diversity, the CD209 locus seems to evolve under evolutionary neutrality. Nevertheless, when we focused our analyses at the protein level, signs of natural selection were uncovered. Indeed, the McDonald-Kreitman test rejected neutrality for this gene because of a clear excess of polymorphic synonymous sites (i.e., a lack of nonsynonymous variants). In addition, when the number of synonymous sites (146) versus nonsynonymous sites (499) was compared with the observed number of synonymous (5) versus nonsynonymous (0) mutations, we detected a significant lack of nonsynonymous mutations (two-tailed Fisher exact test, P=6.3×10−4). These observations point to a strong selective constraint acting on CD209 that prevents the accumulation of amino acid replacements over time. Further support for a functional constraint in CD209 comes from the patterns of diversity observed in the neck region. In contrast to CD209L, virtually no variation was observed at CD209 (fig. 5A), with the 7-repeat allele accounting for 99% of the total variability. Moreover, the low levels of heterozygosity observed resulted in a consistent rejection of mutation-drift equilibrium in almost all geographical regions (table 5). The probability of finding such a low heterozygosity value, given the overall number of alleles observed, was estimated to be <0.2%, independent of the mutational model considered (table 5). Thus, the fact that no alleles other than the 7-repeat allele have increased in frequency, together with recent studies addressing the functional consequences of repeat-number variation in this region (Bernhard et al. 2004; Feinberg et al. 2005), strongly suggests a clear reduced fitness of any allele other than the 7-repeat allele. Interestingly, it has been recently shown that a protein with two fewer repeats (a 5-repeat allele) results in a partial dissociation of the final tetramer, whereas a protein with <5 repeats exhibits a dramatic reduction in overall stability (Feinberg et al. 2005), with all these differences having a direct impact on the quality of ligand-binding functions (Bernhard et al. 2004). Taken together, the patterns of diversity observed at CD209 clearly point to a strong functional constraint acting on this gene and further support the proposed crucial role of this lectin in pathogen recognition and in the early steps of immune response (Geijtenbeek et al. 2000b, 2004). CD209L: Relaxation of the Functional Constraint or Balancing Selection? In clear contrast to its homologue, CD209L presented extremely elevated nucleotide-diversity levels. High levels of diversity can result either from a relaxation of the functional constraint, which allows the stochastic accumulation of new mutations, or from the action of balancing selection, which maintains over time two or more functionally different alleles (and all linked variation) at intermediate frequencies. Several lines of evidence lend support to the selective hypothesis. First, if CD209L nucleotide diversity has been driven by the action of balancing selection, population-genetics relationships would have been accordingly altered. In this context, diversity studies in neutral, or assumedly neutral, regions of the genome—such as the Y chromosome (Underhill et al. 2000; Hammer et al. 2001; Jobling and Tyler-Smith 2003), mtDNA (Wallace et al. 1999; Ingman et al. 2000; Mishmar et al. 2003), Alu insertions (Watkins et al. 2001), as well as some autosomal genes (Stephens et al. 2001; Akey et al. 2004)—showed that African populations are genetically more diverse than are non-Africans, an observation generally interpreted as a support of the “Out of Africa” model for the origin of modern humans (Lewin 1987). For CD209L, even if we observed 1.5 times more segregating sites in African than in non-African populations, as indicated by the higher θw value found in Africa, similar values of nucleotide diversity were detected in the three groups, with Europeans presenting even higher π values than do Africans. This unusual scenario, which is at odds with neutral expectations, has already been described for other regions of the genome, such as the β-globin gene and the 5′cis-regulatory region of CCR5, for which the action of balancing selection has been convincingly proposed (Harding et al. 1997; Bamshad et al. 2002). Second, balancing selection tends to increase within-population diversity while decreasing F ST, compared with neutrally evolving loci (Cavalli-Sforza 1966; Harpending and Rogers 2000; Akey et al. 2002; Bamshad and Wooding 2003; Cavalli-Sforza and Feldman 2003). Indeed, our data are compatible with these predictions, since the 5%F ST value observed for CD209L is threefold lower than that estimated for CD209 (15%) and is similar to that found, for example, for the bitter-taste receptor gene (5.6%), for which there is compelling evidence of balancing-selection action (Wooding et al. 2004). Third, results of our Tajima's D analysis were significantly positive for European and East Asian populations, because of the skew of CD209L frequency spectrum toward an excess of intermediate-frequency alleles (table 2), a pattern that further supports the action of balancing selection. However, since the null model used to assess significance makes unrealistic assumptions about past population demography (i.e., constant population sizes), the rejection of the standard neutral model cannot be interpreted as unambiguous evidence of selection. Indeed, the observation that only non-African populations showed a significant departure from neutrality raises the question of whether these patterns could have resulted instead from the bottleneck that occurred during the Out of Africa exodus. A way to circumvent this conundrum is to analytically integrate the fact that demography affects all the genome equally, whereas selection directs its effects toward specific loci. Thus, to correct for the confounding effects of demography, we plotted our results against the empirical distributions of Akey et al. (2004) for Tajima's D statistics. Our values remained significant for CD209L, which therefore reinforces the idea that the pattern observed is unlikely to be the sole result of demography. Last, if the patterns of variation in CD209L represent the molecular signature of balancing selection, at least in non-Africans, then a functional target of such selective regime is needed. In this context, the neck region constitutes an excellent candidate, since it plays a major mediating role in the orientation and flexibility of the carbohydrate-recognition domain. Since this domain is directly involved in pathogen recognition, neck-region length variation has important consequences for the pathogen-binding properties of these lectins (Mitchell et al. 2001; Bernhard et al. 2004; Feinberg et al. 2005). In perfect agreement with the results of our sequence-based data set, higher diversity in repeat variation was observed in the neck region among non-African populations (Native Americans excepted). Out of Africa, at least three alleles account for most population diversity, whereas, in Africa, the 6- and 7-repeat alleles alone account for 96% of the global variability (fig. 5B). Again, the higher diversity observed out of Africa could be due to a higher level of relaxation of the functional constraint of the neck region in non-African compared with African populations, which would lead to a random accumulation of proteins with varying neck-region lengths among non-Africans. Conversely, these patterns could also be explained by the action of balancing selection in non-Africans and could therefore point to the neck region as the functional target of such selective regime. To evaluate the plausibility of these two conflicting scenarios, we compared the variation in the CD209L neck region with that inferred from 377 neutral autosomal microsatellites typed elsewhere for the same population panel (Rosenberg et al. 2002). We reasoned that if CD209L diversity has been shaped only by demography (i.e., bottleneck out of Africa), the distribution of genetic variance at different hierarchical levels should be comparable to that inferred through the neutral markers. On the other hand, if selection has driven the CD209L neck-region diversity, population-genetics distances would be influenced accordingly and would therefore differ from neutral expectations. Indeed, the AMOVA values inferred for CD209L fell systematically outside the 95% CI defined for the microsatellite data set (table 6 ). We observed that populations within Europe, Asia, the Middle East, and Oceania exhibited lower-than-expected diversity among populations within the same region. A reduction of genetic distances between populations is expected under balancing selection; therefore, the results from the CD209L neck region favor, once again, the action of this selective regime in most non-African populations, in detriment of the neutral hypothesis. One may argue that the differences in the proportions of genetic variance between our data and those of Rosenberg et al. (2002) could be due to differences in the pace of mutation between microsatellite loci and our neck repeated region that could be considered a “coding minisatellite.” However, under neutrality, differences in mutation rate should have a similar and proportional effect in all population comparisons and should influence all values with a similar tendency (i.e., higher or lower values). Indeed, this is not the case: populations within Europe, the Middle East, Central/South Asia, East Asia, and Oceania turned out to be genetically closer than expected, whereas populations within Africa and the Americas exhibited the opposite pattern (table 6), which makes it highly unlikely that mutation-rate differences influenced our conclusions. Table 6 AMOVA for the Neck Region of CD209L AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. a Populations are grouped as described by Rosenberg et al. (2002). b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). Taken together, the integration of the results from levels of nucleotide and amino acid diversity, neutrality tests, population-genetics distances, and neck-region length variation in CD209 and CD209L clearly points to a situation in which CD209 has been under a strong selective constraint that prevents accumulation of any of amino acid changes over time, whereas CD209L variability has most likely been driven by the action of balancing selection, at least in non-African populations. The Footprints of Ancestral Population Diversity In apparent dichotomy with the strong selective constraint described for CD209, we observed an unusual excess of diversity of 35 fixed differences separating the two basal branches of the gene tree (fig. 6). In addition, we estimated a T MRCA of 2.8±0.22 MYA, a time that places the most recent common ancestor of CD209 back in the Pliocene epoch, before the estimated time for the origins of the genus Homo ∼1.9 MYA (Wood 1996; Wood and Collard 1999). A number of studies have already reported loci that present unusually deep coalescent times (Harris and Hey 1999; Zhao et al. 2000; Webster et al. 2003; Garrigan et al. 2005a, 2005b), but our estimation for CD209 remains one of the deepest T MRCA values yet reported (Excoffier 2002). The probability of finding such a deep coalescence time under a scenario of a random-mating population was estimated, through a coalescent process (Laval and Excoffier 2004), to be very low (P=.018) (see fig. 7 ). In addition to the unexpected antiquity of the CD209 locus, we observed a peculiar tree topology made of two highly divergent and frequency-unbalanced lineages, cluster A embracing only 2 internal haplotypes and cluster B comprising the remaining 23 (fig. 6). Figure 7 Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. Different hypotheses can account for such elongated and divergent haplotype patterns. Indeed, the high levels of nucleotide identity between CD209 and CD209L could have led to gene conversion between the two genes, an event that would explain the outlier position of cluster A in the context of CD209 phylogeny. We reasoned that if gene conversion has occurred, we expect that the derived alleles distinguishing clusters A and B in CD209 would correspond to the allelic state observed in their homologous positions in CD209L. Of all positions, only four fit this criterion. In addition, these positions were not physically clustered, which therefore excludes a major gene-conversion event as the explanation of the divergent CD209 phylogeny. Two other circumstances may be responsible for the topology and the time depth of the CD209 gene tree: long-standing balancing selection or ancient population structure, with Africa, in both cases, being the arena of such events (i.e., cluster A is restricted to Africa). Several lines of evidence argue against the balancing-selection hypothesis. First, under this selective regime, one would expect that Tajima's D test would also point in this direction by yielding significantly positive values, which is not the case (table 2). Second, such a long-standing balancing selection in Africa would have entailed a number of recombinant haplotypes between clusters A and B, which, again, is not the case, as illustrated by the high LD levels at CD209 (fig. 3). Third, a claim of balancing selection at this locus must imply a functional difference between the two balanced alleles. Indeed, three nonsynonymous mutations, situated in the neck region, separate cluster A and B, and they could correspond to the alleles under selection. But, if the neck region is the target of selection, it is more likely that the balanced alleles would correspond to different numbers of repeats rather than punctual nucleotide variation within each track, as observed for CD209L and suggested by functional studies (Bernhard et al. 2004; Feinberg et al. 2005). Since no variation in the number of repeats was detected between both clusters, we predict that there are no major functional differences between the two lineages. Taken together, maintenance of ancient lineages by balancing selection does not seem to be responsible for the observed haplotype divergence. In this view, the patterns observed are best explained by an ancestral population structure on the African continent. Indeed, several studies have already proposed that African populations must have been more strongly subdivided and isolated than non-African ones (Harris and Hey 1999; Labuda et al. 2000; Excoffier 2002; Goldstein and Chikhi 2002; Harding and McVean 2004; Satta and Takahata 2004; Garrigan et al. 2005a). In particular, a recent study of the Xp21.1 locus presented convincing statistical evidence that supports the hypothesis that our species does not descend from a single, historically panmictic population (Garrigan et al. 2005a). The divergent haplotype pattern observed at the Xp21.1 locus prompted those authors to explain their data under the isolation-and-admixture (IAA) model and/or a metapopulation model (Harding and McVean 2004; Wakeley 2004). Indeed, as observed for CD209, under an IAA model, the two basal branches are expected to be longer than those under a Wright-Fisher model, depending on the length of time subpopulations spent in isolation. The extent to which the IAA model fits the data depends on the number of mutations, referred as to “congruent sites,” occurring in the two basal branches of the genealogy. For Xp21.1, 10 congruent sites over 24 polymorphisms were observed (i.e., ∼42% of the total number of sites). We applied the same approach to CD209 and obtained a very similar percentage of ∼45%, in good accordance with the IAA model. Our observations, together with a number of autosomal diversity studies, show that modern human diversity appears to have kept genetic traces of admixture among archaic hominid populations. However, a number of questions remain unanswered, such as the time when these admixture events occurred (i.e., before or after the appearance of anatomically modern humans), the precise quantitative contribution of ancient genetic material to our modern gene pool, and the geographic provenance of these genetic vestiges. Conclusions The need of continuous evolution for both the human host and the pathogens is predicted by the Red Queen hypothesis (Van Valen 1973; Bell 1982), in reference to the remark of the Red Queen to Alice in Through the Looking Glass (Carroll 1872): “Now, here, you see, it takes all the running you can do, to keep in the same place.” This metaphor provides a conceptual framework for understanding how interactions between the two species lead to constant natural selection for adaptation and counteradaptation. In this context, one feature exploited by the host immunity genes to increase their defense potential is gene duplication by retention, through conservation of one duplicate, of the currently useful function of the encoded protein, while its twin is liberated to mutate and possibly acquire novel functions (Ohno 1970; Trowsdale and Parham 2004). The lectins CD209 and CD209L represent a prototypic model of a duplicated progeny of ancestral genes that interact with a vast spectrum of pathogens. Our results clearly indicate that these duplicated genes have evolved, and might still evolve, under completely different evolutionary pressures. Whereas one, CD209, shows signals of strong conservation, its paralogue, CD209L, exhibits an excess of sequence diversity compatible with the action of balancing selection. In addition, the strong contrast observed in length variation of the neck region between the two genes may have important consequences in medical genetics. In this context, association studies are now needed that correlate length variation of the neck region and susceptibility to infectious diseases whose etiological agents are known to interact with one (or both) of these lectins. More generally, our study has revealed that even a short segment of the human genome can help uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host immunity genes, as well as traces of ancient population structure in the African continent. The coming years will certainly bring unprecedented large data sets of sequence diversity, genomewide and populationwide, with each genomic region possibly revealing a different aspect of human history. The integration of all these apparently independent pieces of the same reality will provide us with a much broader and more realistic view of the demographic history of the human species, as well as of human adaptation to the different environmental conditions imposed not only by pathogens but also by other major factors such as climate and nutritional resources. |
sec | Introduction Infectious diseases have been paramount among the threats to health and survival for most of human evolutionary history (Haldane 1949; Lederberg 1999; Harpending and Rogers 2000; Cooke and Hill 2001). The interaction of the human host with a wide variety of pathogens has been accompanied by genetic adaptations to spatially and temporally fluctuating selective pressures imposed by the infectious agents. Numerous studies have sought the genetic imprint of natural selection imposed by pathogen pressures in human genes involved in immune response or, more generally, in host-pathogen interactions (Vallender and Lahn 2004). For example, natural selection has acted on such genes as MHC, β-globin, G6PD, IL-2, IL-4, TNFSF5, the Duffy blood group genes, and CCR5 (Ohta 1991; Hughes et al. 1994; Flint et al. 1998; Hamblin and Di Rienzo 2000; Tishkoff et al. 2001; Bamshad et al. 2002; Sabeti et al. 2002; Verrelli et al. 2002). However, little is known about genetic variation of genes involved in direct recognition of pathogens, or pathogens' products, and virtually no studies have investigated the extent to which pathogens have exerted selective pressures on the innate immune system. The phylogenetically ancient innate immune system governs the initial detection of pathogens and stimulates the first line of host defense (Medzhitov and Janeway 1998a, 2000, 2002; Janeway and Medzhitov 2002). Recognition of pathogens is mediated by phagocytic cells through germline-encoded receptors, known as “pattern recognition receptors,” which detect pathogen-associated molecular patterns that are characteristic products of microbial physiology (Kimbrell and Beutler 2001; Janeway and Medzhitov 2002). This initial interaction is then translated into a set of endogenous signals that ultimately lead to the induction of the adaptive immune response (Medzhitov and Janeway 1998b). In recent years, the C-type lectin receptors have received much attention in the area of innate immunology, the results of which were novel functional insights into the primary interface between host and pathogens (Medzhitov 2001; Cook et al. 2003; Fujita et al. 2004; Geijtenbeek et al. 2004; McGreal et al. 2004). In this context, two prototypic members of the C-type lectin–receptor family are particularly interesting, since they can act as both cell-adhesion receptors and pathogen-recognition receptors. These lectins include CD209 (DCSIGN: dendritic cell–specific ICAM-3 grabbing nonintegrin [MIM 604672]) and its close relative CD209L (L-SIGN: liver/lymph node–specific ICAM-3 grabbing nonintegrin [MIM 605872]) (Curtis et al. 1992; Geijtenbeek et al. 2000b, 2004; Soilleux et al. 2000; Pohlmann et al. 2001). These lectin-coding genes are located on chromosome 19p13.2-3, within an ∼26-kb segment, and result from a duplication of an ancestral gene (Bashirova et al. 2003; Soilleux 2003). An additional characteristic of both CD209 and CD209L is the presence of a neck region, primarily made up of seven highly conserved 23-aa repeats, that separates the carbohydrate-recognition domain involved in pathogen binding from the transmembrane region. This neck region presents high nucleotide identity between repeats, both within each molecule and between CD209 and CD209L. It has been shown that this region plays a crucial role in the oligomerization and support of the carbohydrate-recognition domain; therefore, it influences the pathogen-binding properties of these two receptors (Soilleux et al. 2000, 2003; Feinberg et al. 2005). In regard to expression profiles, CD209 is expressed primarily on phagocytic cells, such as dendritic cells and macrophages, whereas CD209L expression is restricted to endothelial cells in liver and lymph nodes (Bashirova et al. 2001; Soilleux et al. 2001, 2002). As pathogen-recognition receptors, the two lectins have been shown to recognize a vast range of microbes, some of which are of major public health importance (Geijtenbeek et al. 2004). Indeed, CD209 captures bacteria such as Mycobacterium tuberculosis, Helicobacter pylori, and certain Klebsiela pneumonia strains; viruses such as HIV-1, Ebola virus, cytomegalovirus, hepatitis C virus, Dengue virus, and SARS-coronavirus; and parasites like Leishmania pifanoi and Schistosoma mansoni (Geijtenbeek et al. 2000a, 2003; Alvarez et al. 2002; Colmenares et al. 2002; Halary et al. 2002; Appelmelk et al. 2003; Lozach et al. 2003; Tailleux et al. 2003; Tassaneetrithep et al. 2003; Bergman et al. 2004; Marzi et al. 2004). With regard to CD209L, studies to date have shown an interaction with a variety of viruses, including HIV, hepatitis C, Ebola, and coronavirus, as well as with the parasite Schistosoma mansoni (Bashirova et al. 2001; Alvarez et al. 2002; Gardner et al. 2003; Jeffers et al. 2004; Van Liempt et al. 2004). In this context, the efficiency of the two lectins in pathogen recognition and subsequent processing may have important consequences for the quality of host immune responses and consequent pathogen control and/or clearance. An important step forward in the understanding of human adaptation to pathogens and control of infectious diseases includes the description of quality and quantity of genetic variation in genes involved in host recognition of infectious agents. Given the direct interaction of CD209 and CD209L with a large variety of pathogens, the CD209/CD209L genomic region provides an excellent model system to illustrate the extent to which pathogens have exerted selective pressures on host immunity genes. An additional feature that makes these genes highly interesting in evolutionary studies is that they are likely to have been influenced by similar genomic forces (recombination, mutation rates, etc.) because of their close physical proximity (∼15 kb), high nucleotide (73%) and amino acid (77%) identity, and identical exon-intron organization (Soilleux 2003) (fig. 1 ). In addition, it has been proposed that gene duplication of immunity genes is a molecular strategy developed by the host to enlarge its defense potential (Ohno 1970; Trowsdale and Parham 2004). A number of immune-system gene families have evolved, by gene duplication followed by natural selection, to provide responses to a wider range of pathogens, with welldocumented examples in immunoglobulin and MHC genes (Hughes et al. 1994; Ota et al. 2000). In this context, duplicated genes in cis, like CD209 and CD209L, may have undergone differential selective pressures to enlarge the defense role of these lectins. To address these complex issues, we performed a sequence-based survey of the entire CD209/CD209L region in a panel of individuals of different ethnic origins. Here, we report evidence showing that these two closely related innate immunity genes have gone through completely different evolutionary processes that are reflected in their current patterns of diversity. In addition, our study provides novel insights into how pathogens have shaped the patterns of variability of immunity genes resulting from gene duplication. Figure 1 Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown. |
title | Introduction |
p | Infectious diseases have been paramount among the threats to health and survival for most of human evolutionary history (Haldane 1949; Lederberg 1999; Harpending and Rogers 2000; Cooke and Hill 2001). The interaction of the human host with a wide variety of pathogens has been accompanied by genetic adaptations to spatially and temporally fluctuating selective pressures imposed by the infectious agents. Numerous studies have sought the genetic imprint of natural selection imposed by pathogen pressures in human genes involved in immune response or, more generally, in host-pathogen interactions (Vallender and Lahn 2004). For example, natural selection has acted on such genes as MHC, β-globin, G6PD, IL-2, IL-4, TNFSF5, the Duffy blood group genes, and CCR5 (Ohta 1991; Hughes et al. 1994; Flint et al. 1998; Hamblin and Di Rienzo 2000; Tishkoff et al. 2001; Bamshad et al. 2002; Sabeti et al. 2002; Verrelli et al. 2002). However, little is known about genetic variation of genes involved in direct recognition of pathogens, or pathogens' products, and virtually no studies have investigated the extent to which pathogens have exerted selective pressures on the innate immune system. |
p | The phylogenetically ancient innate immune system governs the initial detection of pathogens and stimulates the first line of host defense (Medzhitov and Janeway 1998a, 2000, 2002; Janeway and Medzhitov 2002). Recognition of pathogens is mediated by phagocytic cells through germline-encoded receptors, known as “pattern recognition receptors,” which detect pathogen-associated molecular patterns that are characteristic products of microbial physiology (Kimbrell and Beutler 2001; Janeway and Medzhitov 2002). This initial interaction is then translated into a set of endogenous signals that ultimately lead to the induction of the adaptive immune response (Medzhitov and Janeway 1998b). |
p | In recent years, the C-type lectin receptors have received much attention in the area of innate immunology, the results of which were novel functional insights into the primary interface between host and pathogens (Medzhitov 2001; Cook et al. 2003; Fujita et al. 2004; Geijtenbeek et al. 2004; McGreal et al. 2004). In this context, two prototypic members of the C-type lectin–receptor family are particularly interesting, since they can act as both cell-adhesion receptors and pathogen-recognition receptors. These lectins include CD209 (DCSIGN: dendritic cell–specific ICAM-3 grabbing nonintegrin [MIM 604672]) and its close relative CD209L (L-SIGN: liver/lymph node–specific ICAM-3 grabbing nonintegrin [MIM 605872]) (Curtis et al. 1992; Geijtenbeek et al. 2000b, 2004; Soilleux et al. 2000; Pohlmann et al. 2001). These lectin-coding genes are located on chromosome 19p13.2-3, within an ∼26-kb segment, and result from a duplication of an ancestral gene (Bashirova et al. 2003; Soilleux 2003). An additional characteristic of both CD209 and CD209L is the presence of a neck region, primarily made up of seven highly conserved 23-aa repeats, that separates the carbohydrate-recognition domain involved in pathogen binding from the transmembrane region. This neck region presents high nucleotide identity between repeats, both within each molecule and between CD209 and CD209L. It has been shown that this region plays a crucial role in the oligomerization and support of the carbohydrate-recognition domain; therefore, it influences the pathogen-binding properties of these two receptors (Soilleux et al. 2000, 2003; Feinberg et al. 2005). In regard to expression profiles, CD209 is expressed primarily on phagocytic cells, such as dendritic cells and macrophages, whereas CD209L expression is restricted to endothelial cells in liver and lymph nodes (Bashirova et al. 2001; Soilleux et al. 2001, 2002). As pathogen-recognition receptors, the two lectins have been shown to recognize a vast range of microbes, some of which are of major public health importance (Geijtenbeek et al. 2004). Indeed, CD209 captures bacteria such as Mycobacterium tuberculosis, Helicobacter pylori, and certain Klebsiela pneumonia strains; viruses such as HIV-1, Ebola virus, cytomegalovirus, hepatitis C virus, Dengue virus, and SARS-coronavirus; and parasites like Leishmania pifanoi and Schistosoma mansoni (Geijtenbeek et al. 2000a, 2003; Alvarez et al. 2002; Colmenares et al. 2002; Halary et al. 2002; Appelmelk et al. 2003; Lozach et al. 2003; Tailleux et al. 2003; Tassaneetrithep et al. 2003; Bergman et al. 2004; Marzi et al. 2004). With regard to CD209L, studies to date have shown an interaction with a variety of viruses, including HIV, hepatitis C, Ebola, and coronavirus, as well as with the parasite Schistosoma mansoni (Bashirova et al. 2001; Alvarez et al. 2002; Gardner et al. 2003; Jeffers et al. 2004; Van Liempt et al. 2004). In this context, the efficiency of the two lectins in pathogen recognition and subsequent processing may have important consequences for the quality of host immune responses and consequent pathogen control and/or clearance. |
p | An important step forward in the understanding of human adaptation to pathogens and control of infectious diseases includes the description of quality and quantity of genetic variation in genes involved in host recognition of infectious agents. Given the direct interaction of CD209 and CD209L with a large variety of pathogens, the CD209/CD209L genomic region provides an excellent model system to illustrate the extent to which pathogens have exerted selective pressures on host immunity genes. An additional feature that makes these genes highly interesting in evolutionary studies is that they are likely to have been influenced by similar genomic forces (recombination, mutation rates, etc.) because of their close physical proximity (∼15 kb), high nucleotide (73%) and amino acid (77%) identity, and identical exon-intron organization (Soilleux 2003) (fig. 1 ). In addition, it has been proposed that gene duplication of immunity genes is a molecular strategy developed by the host to enlarge its defense potential (Ohno 1970; Trowsdale and Parham 2004). A number of immune-system gene families have evolved, by gene duplication followed by natural selection, to provide responses to a wider range of pathogens, with welldocumented examples in immunoglobulin and MHC genes (Hughes et al. 1994; Ota et al. 2000). In this context, duplicated genes in cis, like CD209 and CD209L, may have undergone differential selective pressures to enlarge the defense role of these lectins. To address these complex issues, we performed a sequence-based survey of the entire CD209/CD209L region in a panel of individuals of different ethnic origins. Here, we report evidence showing that these two closely related innate immunity genes have gone through completely different evolutionary processes that are reflected in their current patterns of diversity. In addition, our study provides novel insights into how pathogens have shaped the patterns of variability of immunity genes resulting from gene duplication. Figure 1 Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown. |
figure | Figure 1 Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown. |
label | Figure 1 |
caption | Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown. |
p | Scaled diagram of the CD209/CD209L genomic region. Sequenced regions are represented in gray. For CD209, we sequenced a total of 5,500 bp per chromosome, and, for CD209L, 5,391 bp per chromosome. The neck region corresponding to exon 4 and composed of seven coding repeats is also shown. |
sec | Material and Methods Population Samples Sequence variation of the CD209/CD209L region was determined in 41 sub-Saharan Africans, 43 Europeans, and 43 East Asians, in a total of 254 chromosomes from the Human Genome Diversity Panel (HGDP)–CEPH panel (Cann et al. 2002). More-detailed information about the composition of the three major ethnic groups can be found in table 1 . The variation in the repeat number of the neck region of CD209 and CD209L was defined in the entire HGDP-CEPH panel, comprising 1,064 DNA samples from 52 worldwide populations. In addition, the orthologous regions for both genes were sequenced in four chimpanzees (Pan troglodytes). Table 1 Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity Location and Population Geographic Origin No. of Chromosomes Africa: Biaka Pygmies Central African Republic 12 Mbuti Pygmies Democratic Republic of Congo 10 Bantu Fanga Gabon 12 Bantu, northeastern Kenya 10 Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 Mandenka Senegal 12 San Namibia 8 Yoruba Nigeria 10 Europe: Adygei Russian Caucasus 4 French France 16 French Basque France 12 Sardinian Italy 12 Russian Russia 12 Orcadian Orkney Islands 14 North Italian Italy (Bergamo) 16 East Asia: Japanese Japan 42 Han China 4 Tujia China 4 Yizu China 4 Miaozu China 4 Orogen China 4 Daur China 4 Mongola China 4 Hezhen China 4 Xibo China 4 Uygur China 4 Dai China 4 a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. Molecular Analyses The sequenced fragments of the CD209/CD209L genomic region are shown in figure 1. The entire CD209 region—including exons, introns, and ∼1 kb of the 5′ UTR corresponding to the promoter region—was sequenced, for a total of 5.5 kb per individual. For CD209L, we sequenced a total of ∼5.4 kb per individual, following the same approach used for CD209, with the exception of the neck region. That region was genotyped for its number of repeats, since it turned out to be highly polymorphic, which prevented the sequencing process. Genotyping was performed by a single PCR amplification followed by migration in 2% agarose gels. Human primers were used to both amplify and sequence the orthologous regions in chimpanzees. However, because of polymorphisms specific to the chimpanzee lineage, we could not obtain the entirety of the sequence. Thus, 4.9 kb (90% of the total) of the chimpanzee CD209 sequence were obtained, and 5.3 kb (98% of the total) of CD209L. Detailed information on primer sequences and PCR amplification conditions is available on request. All nucleotide sequences were obtained using the Big Dye terminator kit and the 3100 automated sequencer from Applied Biosystems. Sequence files and chromatograms were inspected using the GENALYS software (Takahashi et al. 2003; Centre National de Genotypage). As a measure of quality control, when new mutations were identified in primer binding regions, new primers were designed and sequence reactions were repeated, to avoid allele-specific amplification. All singletons observed in our data set were systematically reamplified and resequenced. Statistical Analyses On the basis of the levels of diversity observed in the CD209/CD209L genomic region, we calculated the average number of pairwise differences (π) and the Watterson's estimator (θw) (Watterson 1975). Under the standard neutral model of a randomly mating population of constant size, these are unbiased estimators of the population mutation rate θ=4Neμ, where N e is the diploid effective population size and μ is the mutation rate per generation per site. To test whether the frequency spectrum of mutations conformed to the expectations of this standard neutral model, we calculated Tajima's D (Tajima 1989) and Fay and Wu's H tests (Fay and Wu 2000). P values for the different tests were estimated from 104 coalescent simulations under an infinite-site model, with use of a fixed number of segregating sites and the assumption of no recombination, which has been shown to be a conservative assumption (Gilad et al. 2002). In parallel, we estimated P values for all these tests, using the empirical distribution obtained from sequencing data of 132 genes in a panel of 24 African Americans and 23 European Americans (Akey et al. 2004). All these analyses, together with the interspecies McDonald-Kreitman (McDonald and Kreitman 1991) and K A/K S (Kimura 1968) tests, were performed using the DnaSP package (Rozas et al. 2003). Genetic distances between populations (F ST) and heterozygosity values were estimated using the Arlequin package (Schneider et al. 2000). F ST statistical significance was assessed using 10,000 bootstrap replications. To bear out a deficit or an excess of heterozygosity in the neck region of CD209 and CD209L, we used BOTTLENECK (Cornuet and Luikart 1996) to compute for each geographic region, the distribution of the heterozygosity expected from the observed number of alleles, given the sample size (n) under the assumption of mutational-drift equilibrium. This distribution was obtained through simulation of the coalescent process of n genes under two mutational models, the infinite-site model and the stepwise mutation model. In addition, to obtain information on the fraction of genetic variance in the neck region that is due to intra- and interpopulation differences, we performed an analysis of molecular variance (AMOVA), using the Arlequin package (Schneider et al. 2000). The AMOVA results were compared with those of 377 microsatellites analyzed in the same population panel (Rosenberg et al. 2002). Haplotype reconstruction was performed by use of the Bayesian statistical method implemented in Phase (v.2.1.1) (Stephens and Donnelly 2003). We applied the algorithm five times, using different randomly generated seeds, and consistent results were obtained across runs. After haplotype reconstruction, linkage disequilibrium (LD) between pairs of SNPs was computed using Lewontin's D′ index (Lewontin 1964). For this analysis, only markers presenting a minimum allele frequency (MAF) of 10% were considered, since rare alleles have been shown to present a higher probability of being in significant LD than do common ones (Reich et al. 2001). The graphic display of the LD plots was constructed using GOLD (Abecasis and Cookson 2000; Center for Statistical Genetics). To support the existence of a recombination hotspot in the region under study, we used the hotspot-recombination model implemented in Phase (v.2.1.1). Under this model, we assumed that there was, at most, one hotspot of unknown position. We then estimated the background population-recombination rate (ρ) and the relative intensity of any recombination hotspot. To obtain better estimates, we increased 10 times the number of iterations of the final run of the algorithm. All our estimations were obtained by averaging results of five independent runs with use of different seed numbers. Since the model used is Bayesian, we could also estimate, for each population, the posterior probability of a hotspot of intensity >1 (λ>1) and >10 (λ>10). We obtained the gene tree and estimated the time of the most recent common ancestor (T MRCA) for CD209, using the maximum-likelihood coalescent method implemented in GENETREE (Griffiths and Tavare 1994). The mutation rate μ for each gene was estimated on the basis of the net divergence between humans and chimpanzees and under the assumption both that the species separation occurred 5 million years ago (MYA) and of a generation time of 20 years. Using this μ and θ maximum likelihood (θML), we estimated the effective population size parameter (N e). With the assumption of a generation time of 20 years and the estimated N e, the coalescence time, scaled in 2Ne units, was converted into years. The coalescence process implemented in SIMCOAL2 (Laval and Excoffier 2004) allowed us to estimate the probability of the T MRCA for CD209, through 2×104 simulations, with use of both the number of observed segregating sites and the estimated N e . |
title | Material and Methods |
sec | Population Samples Sequence variation of the CD209/CD209L region was determined in 41 sub-Saharan Africans, 43 Europeans, and 43 East Asians, in a total of 254 chromosomes from the Human Genome Diversity Panel (HGDP)–CEPH panel (Cann et al. 2002). More-detailed information about the composition of the three major ethnic groups can be found in table 1 . The variation in the repeat number of the neck region of CD209 and CD209L was defined in the entire HGDP-CEPH panel, comprising 1,064 DNA samples from 52 worldwide populations. In addition, the orthologous regions for both genes were sequenced in four chimpanzees (Pan troglodytes). Table 1 Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity Location and Population Geographic Origin No. of Chromosomes Africa: Biaka Pygmies Central African Republic 12 Mbuti Pygmies Democratic Republic of Congo 10 Bantu Fanga Gabon 12 Bantu, northeastern Kenya 10 Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 Mandenka Senegal 12 San Namibia 8 Yoruba Nigeria 10 Europe: Adygei Russian Caucasus 4 French France 16 French Basque France 12 Sardinian Italy 12 Russian Russia 12 Orcadian Orkney Islands 14 North Italian Italy (Bergamo) 16 East Asia: Japanese Japan 42 Han China 4 Tujia China 4 Yizu China 4 Miaozu China 4 Orogen China 4 Daur China 4 Mongola China 4 Hezhen China 4 Xibo China 4 Uygur China 4 Dai China 4 a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. |
title | Population Samples |
p | Sequence variation of the CD209/CD209L region was determined in 41 sub-Saharan Africans, 43 Europeans, and 43 East Asians, in a total of 254 chromosomes from the Human Genome Diversity Panel (HGDP)–CEPH panel (Cann et al. 2002). More-detailed information about the composition of the three major ethnic groups can be found in table 1 . The variation in the repeat number of the neck region of CD209 and CD209L was defined in the entire HGDP-CEPH panel, comprising 1,064 DNA samples from 52 worldwide populations. In addition, the orthologous regions for both genes were sequenced in four chimpanzees (Pan troglodytes). Table 1 Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity Location and Population Geographic Origin No. of Chromosomes Africa: Biaka Pygmies Central African Republic 12 Mbuti Pygmies Democratic Republic of Congo 10 Bantu Fanga Gabon 12 Bantu, northeastern Kenya 10 Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 Mandenka Senegal 12 San Namibia 8 Yoruba Nigeria 10 Europe: Adygei Russian Caucasus 4 French France 16 French Basque France 12 Sardinian Italy 12 Russian Russia 12 Orcadian Orkney Islands 14 North Italian Italy (Bergamo) 16 East Asia: Japanese Japan 42 Han China 4 Tujia China 4 Yizu China 4 Miaozu China 4 Orogen China 4 Daur China 4 Mongola China 4 Hezhen China 4 Xibo China 4 Uygur China 4 Dai China 4 a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. |
table-wrap | Table 1 Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity Location and Population Geographic Origin No. of Chromosomes Africa: Biaka Pygmies Central African Republic 12 Mbuti Pygmies Democratic Republic of Congo 10 Bantu Fanga Gabon 12 Bantu, northeastern Kenya 10 Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 Mandenka Senegal 12 San Namibia 8 Yoruba Nigeria 10 Europe: Adygei Russian Caucasus 4 French France 16 French Basque France 12 Sardinian Italy 12 Russian Russia 12 Orcadian Orkney Islands 14 North Italian Italy (Bergamo) 16 East Asia: Japanese Japan 42 Han China 4 Tujia China 4 Yizu China 4 Miaozu China 4 Orogen China 4 Daur China 4 Mongola China 4 Hezhen China 4 Xibo China 4 Uygur China 4 Dai China 4 a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. |
label | Table 1 |
caption | Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity |
p | Individual Composition of the Study Populations for CD209 and CD209L Sequence Diversity |
table | Location and Population Geographic Origin No. of Chromosomes Africa: Biaka Pygmies Central African Republic 12 Mbuti Pygmies Democratic Republic of Congo 10 Bantu Fanga Gabon 12 Bantu, northeastern Kenya 10 Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 Mandenka Senegal 12 San Namibia 8 Yoruba Nigeria 10 Europe: Adygei Russian Caucasus 4 French France 16 French Basque France 12 Sardinian Italy 12 Russian Russia 12 Orcadian Orkney Islands 14 North Italian Italy (Bergamo) 16 East Asia: Japanese Japan 42 Han China 4 Tujia China 4 Yizu China 4 Miaozu China 4 Orogen China 4 Daur China 4 Mongola China 4 Hezhen China 4 Xibo China 4 Uygur China 4 Dai China 4 |
tr | Location and Population Geographic Origin No. of Chromosomes |
th | Location and Population |
th | Geographic Origin |
th | No. of Chromosomes |
tr | Africa: |
td | Africa: |
tr | Biaka Pygmies Central African Republic 12 |
td | Biaka Pygmies |
td | Central African Republic |
td | 12 |
tr | Mbuti Pygmies Democratic Republic of Congo 10 |
td | Mbuti Pygmies |
td | Democratic Republic of Congo |
td | 10 |
tr | Bantu Fanga Gabon 12 |
td | Bantu Fanga |
td | Gabon |
td | 12 |
tr | Bantu, northeastern Kenya 10 |
td | Bantu, northeastern |
td | Kenya |
td | 10 |
tr | Bantu, southeastern Tswana South Africa (Bantu, southeastern) 4 |
td | Bantu, southeastern Tswana |
td | South Africa (Bantu, southeastern) |
td | 4 |
tr | Bantu, southwestern Herero South Africa (Bantu, southwestern) 4 |
td | Bantu, southwestern Herero |
td | South Africa (Bantu, southwestern) |
td | 4 |
tr | Mandenka Senegal 12 |
td | Mandenka |
td | Senegal |
td | 12 |
tr | San Namibia 8 |
td | San |
td | Namibia |
td | 8 |
tr | Yoruba Nigeria 10 |
td | Yoruba |
td | Nigeria |
td | 10 |
tr | Europe: |
td | Europe: |
tr | Adygei Russian Caucasus 4 |
td | Adygei |
td | Russian Caucasus |
td | 4 |
tr | French France 16 |
td | French |
td | France |
td | 16 |
tr | French Basque France 12 |
td | French Basque |
td | France |
td | 12 |
tr | Sardinian Italy 12 |
td | Sardinian |
td | Italy |
td | 12 |
tr | Russian Russia 12 |
td | Russian |
td | Russia |
td | 12 |
tr | Orcadian Orkney Islands 14 |
td | Orcadian |
td | Orkney Islands |
td | 14 |
tr | North Italian Italy (Bergamo) 16 |
td | North Italian |
td | Italy (Bergamo) |
td | 16 |
tr | East Asia: |
td | East Asia: |
tr | Japanese Japan 42 |
td | Japanese |
td | Japan |
td | 42 |
tr | Han China 4 |
td | Han |
td | China |
td | 4 |
tr | Tujia China 4 |
td | Tujia |
td | China |
td | 4 |
tr | Yizu China 4 |
td | Yizu |
td | China |
td | 4 |
tr | Miaozu China 4 |
td | Miaozu |
td | China |
td | 4 |
tr | Orogen China 4 |
td | Orogen |
td | China |
td | 4 |
tr | Daur China 4 |
td | Daur |
td | China |
td | 4 |
tr | Mongola China 4 |
td | Mongola |
td | China |
td | 4 |
tr | Hezhen China 4 |
td | Hezhen |
td | China |
td | 4 |
tr | Xibo China 4 |
td | Xibo |
td | China |
td | 4 |
tr | Uygur China 4 |
td | Uygur |
td | China |
td | 4 |
tr | Dai China 4 |
td | Dai |
td | China |
td | 4 |
table-wrap-foot | a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. |
footnote | a These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. |
label | a |
p | These individuals are not included in the HGDP-CEPH panel but are part of the authors' collection. |
sec | Molecular Analyses The sequenced fragments of the CD209/CD209L genomic region are shown in figure 1. The entire CD209 region—including exons, introns, and ∼1 kb of the 5′ UTR corresponding to the promoter region—was sequenced, for a total of 5.5 kb per individual. For CD209L, we sequenced a total of ∼5.4 kb per individual, following the same approach used for CD209, with the exception of the neck region. That region was genotyped for its number of repeats, since it turned out to be highly polymorphic, which prevented the sequencing process. Genotyping was performed by a single PCR amplification followed by migration in 2% agarose gels. Human primers were used to both amplify and sequence the orthologous regions in chimpanzees. However, because of polymorphisms specific to the chimpanzee lineage, we could not obtain the entirety of the sequence. Thus, 4.9 kb (90% of the total) of the chimpanzee CD209 sequence were obtained, and 5.3 kb (98% of the total) of CD209L. Detailed information on primer sequences and PCR amplification conditions is available on request. All nucleotide sequences were obtained using the Big Dye terminator kit and the 3100 automated sequencer from Applied Biosystems. Sequence files and chromatograms were inspected using the GENALYS software (Takahashi et al. 2003; Centre National de Genotypage). As a measure of quality control, when new mutations were identified in primer binding regions, new primers were designed and sequence reactions were repeated, to avoid allele-specific amplification. All singletons observed in our data set were systematically reamplified and resequenced. |
title | Molecular Analyses |
p | The sequenced fragments of the CD209/CD209L genomic region are shown in figure 1. The entire CD209 region—including exons, introns, and ∼1 kb of the 5′ UTR corresponding to the promoter region—was sequenced, for a total of 5.5 kb per individual. For CD209L, we sequenced a total of ∼5.4 kb per individual, following the same approach used for CD209, with the exception of the neck region. That region was genotyped for its number of repeats, since it turned out to be highly polymorphic, which prevented the sequencing process. Genotyping was performed by a single PCR amplification followed by migration in 2% agarose gels. Human primers were used to both amplify and sequence the orthologous regions in chimpanzees. However, because of polymorphisms specific to the chimpanzee lineage, we could not obtain the entirety of the sequence. Thus, 4.9 kb (90% of the total) of the chimpanzee CD209 sequence were obtained, and 5.3 kb (98% of the total) of CD209L. Detailed information on primer sequences and PCR amplification conditions is available on request. All nucleotide sequences were obtained using the Big Dye terminator kit and the 3100 automated sequencer from Applied Biosystems. Sequence files and chromatograms were inspected using the GENALYS software (Takahashi et al. 2003; Centre National de Genotypage). As a measure of quality control, when new mutations were identified in primer binding regions, new primers were designed and sequence reactions were repeated, to avoid allele-specific amplification. All singletons observed in our data set were systematically reamplified and resequenced. |
sec | Statistical Analyses On the basis of the levels of diversity observed in the CD209/CD209L genomic region, we calculated the average number of pairwise differences (π) and the Watterson's estimator (θw) (Watterson 1975). Under the standard neutral model of a randomly mating population of constant size, these are unbiased estimators of the population mutation rate θ=4Neμ, where N e is the diploid effective population size and μ is the mutation rate per generation per site. To test whether the frequency spectrum of mutations conformed to the expectations of this standard neutral model, we calculated Tajima's D (Tajima 1989) and Fay and Wu's H tests (Fay and Wu 2000). P values for the different tests were estimated from 104 coalescent simulations under an infinite-site model, with use of a fixed number of segregating sites and the assumption of no recombination, which has been shown to be a conservative assumption (Gilad et al. 2002). In parallel, we estimated P values for all these tests, using the empirical distribution obtained from sequencing data of 132 genes in a panel of 24 African Americans and 23 European Americans (Akey et al. 2004). All these analyses, together with the interspecies McDonald-Kreitman (McDonald and Kreitman 1991) and K A/K S (Kimura 1968) tests, were performed using the DnaSP package (Rozas et al. 2003). Genetic distances between populations (F ST) and heterozygosity values were estimated using the Arlequin package (Schneider et al. 2000). F ST statistical significance was assessed using 10,000 bootstrap replications. To bear out a deficit or an excess of heterozygosity in the neck region of CD209 and CD209L, we used BOTTLENECK (Cornuet and Luikart 1996) to compute for each geographic region, the distribution of the heterozygosity expected from the observed number of alleles, given the sample size (n) under the assumption of mutational-drift equilibrium. This distribution was obtained through simulation of the coalescent process of n genes under two mutational models, the infinite-site model and the stepwise mutation model. In addition, to obtain information on the fraction of genetic variance in the neck region that is due to intra- and interpopulation differences, we performed an analysis of molecular variance (AMOVA), using the Arlequin package (Schneider et al. 2000). The AMOVA results were compared with those of 377 microsatellites analyzed in the same population panel (Rosenberg et al. 2002). Haplotype reconstruction was performed by use of the Bayesian statistical method implemented in Phase (v.2.1.1) (Stephens and Donnelly 2003). We applied the algorithm five times, using different randomly generated seeds, and consistent results were obtained across runs. After haplotype reconstruction, linkage disequilibrium (LD) between pairs of SNPs was computed using Lewontin's D′ index (Lewontin 1964). For this analysis, only markers presenting a minimum allele frequency (MAF) of 10% were considered, since rare alleles have been shown to present a higher probability of being in significant LD than do common ones (Reich et al. 2001). The graphic display of the LD plots was constructed using GOLD (Abecasis and Cookson 2000; Center for Statistical Genetics). To support the existence of a recombination hotspot in the region under study, we used the hotspot-recombination model implemented in Phase (v.2.1.1). Under this model, we assumed that there was, at most, one hotspot of unknown position. We then estimated the background population-recombination rate (ρ) and the relative intensity of any recombination hotspot. To obtain better estimates, we increased 10 times the number of iterations of the final run of the algorithm. All our estimations were obtained by averaging results of five independent runs with use of different seed numbers. Since the model used is Bayesian, we could also estimate, for each population, the posterior probability of a hotspot of intensity >1 (λ>1) and >10 (λ>10). We obtained the gene tree and estimated the time of the most recent common ancestor (T MRCA) for CD209, using the maximum-likelihood coalescent method implemented in GENETREE (Griffiths and Tavare 1994). The mutation rate μ for each gene was estimated on the basis of the net divergence between humans and chimpanzees and under the assumption both that the species separation occurred 5 million years ago (MYA) and of a generation time of 20 years. Using this μ and θ maximum likelihood (θML), we estimated the effective population size parameter (N e). With the assumption of a generation time of 20 years and the estimated N e, the coalescence time, scaled in 2Ne units, was converted into years. The coalescence process implemented in SIMCOAL2 (Laval and Excoffier 2004) allowed us to estimate the probability of the T MRCA for CD209, through 2×104 simulations, with use of both the number of observed segregating sites and the estimated N e . |
title | Statistical Analyses |
p | On the basis of the levels of diversity observed in the CD209/CD209L genomic region, we calculated the average number of pairwise differences (π) and the Watterson's estimator (θw) (Watterson 1975). Under the standard neutral model of a randomly mating population of constant size, these are unbiased estimators of the population mutation rate θ=4Neμ, where N e is the diploid effective population size and μ is the mutation rate per generation per site. To test whether the frequency spectrum of mutations conformed to the expectations of this standard neutral model, we calculated Tajima's D (Tajima 1989) and Fay and Wu's H tests (Fay and Wu 2000). P values for the different tests were estimated from 104 coalescent simulations under an infinite-site model, with use of a fixed number of segregating sites and the assumption of no recombination, which has been shown to be a conservative assumption (Gilad et al. 2002). In parallel, we estimated P values for all these tests, using the empirical distribution obtained from sequencing data of 132 genes in a panel of 24 African Americans and 23 European Americans (Akey et al. 2004). All these analyses, together with the interspecies McDonald-Kreitman (McDonald and Kreitman 1991) and K A/K S (Kimura 1968) tests, were performed using the DnaSP package (Rozas et al. 2003). Genetic distances between populations (F ST) and heterozygosity values were estimated using the Arlequin package (Schneider et al. 2000). F ST statistical significance was assessed using 10,000 bootstrap replications. To bear out a deficit or an excess of heterozygosity in the neck region of CD209 and CD209L, we used BOTTLENECK (Cornuet and Luikart 1996) to compute for each geographic region, the distribution of the heterozygosity expected from the observed number of alleles, given the sample size (n) under the assumption of mutational-drift equilibrium. This distribution was obtained through simulation of the coalescent process of n genes under two mutational models, the infinite-site model and the stepwise mutation model. In addition, to obtain information on the fraction of genetic variance in the neck region that is due to intra- and interpopulation differences, we performed an analysis of molecular variance (AMOVA), using the Arlequin package (Schneider et al. 2000). The AMOVA results were compared with those of 377 microsatellites analyzed in the same population panel (Rosenberg et al. 2002). |
p | Haplotype reconstruction was performed by use of the Bayesian statistical method implemented in Phase (v.2.1.1) (Stephens and Donnelly 2003). We applied the algorithm five times, using different randomly generated seeds, and consistent results were obtained across runs. After haplotype reconstruction, linkage disequilibrium (LD) between pairs of SNPs was computed using Lewontin's D′ index (Lewontin 1964). For this analysis, only markers presenting a minimum allele frequency (MAF) of 10% were considered, since rare alleles have been shown to present a higher probability of being in significant LD than do common ones (Reich et al. 2001). The graphic display of the LD plots was constructed using GOLD (Abecasis and Cookson 2000; Center for Statistical Genetics). To support the existence of a recombination hotspot in the region under study, we used the hotspot-recombination model implemented in Phase (v.2.1.1). Under this model, we assumed that there was, at most, one hotspot of unknown position. We then estimated the background population-recombination rate (ρ) and the relative intensity of any recombination hotspot. To obtain better estimates, we increased 10 times the number of iterations of the final run of the algorithm. All our estimations were obtained by averaging results of five independent runs with use of different seed numbers. Since the model used is Bayesian, we could also estimate, for each population, the posterior probability of a hotspot of intensity >1 (λ>1) and >10 (λ>10). |
p | We obtained the gene tree and estimated the time of the most recent common ancestor (T MRCA) for CD209, using the maximum-likelihood coalescent method implemented in GENETREE (Griffiths and Tavare 1994). The mutation rate μ for each gene was estimated on the basis of the net divergence between humans and chimpanzees and under the assumption both that the species separation occurred 5 million years ago (MYA) and of a generation time of 20 years. Using this μ and θ maximum likelihood (θML), we estimated the effective population size parameter (N e). With the assumption of a generation time of 20 years and the estimated N e, the coalescence time, scaled in 2Ne units, was converted into years. The coalescence process implemented in SIMCOAL2 (Laval and Excoffier 2004) allowed us to estimate the probability of the T MRCA for CD209, through 2×104 simulations, with use of both the number of observed segregating sites and the estimated N e . |
sec | Results We determined sequence diversity in the CD209 and CD209L genes (fig. 1) as well as length variation of the neck region in 254 chromosomes originating from three major ethnic groups: sub-Saharan Africans, Europeans, and East Asians. In addition, the orthologous sequences were obtained in four chimpanzees, to infer the ancestral state at each site, to estimate the divergence between humans and chimpanzees, and to perform a number of interspecies neutrality tests. Patterns of Nucleotide and Haplotype Diversity in the CD209/CD209L Region For CD209, we identified a total of 79 SNPs and 2 indels, including 5 nonsynonymous, 5 synonymous, and 71 noncoding variants. The five nonsynonymous SNPs were all located in the neck region (exon 4): SNPs 1839 (Arg→Gln), 1888 (Glu→Asp), and 1908 (Arg→Gln) achieved a frequency of ∼15%, and SNP 1970 (Leu→Val), a frequency of 6%. These mutations were restricted to the African sample. SNP 1472 (Ala→Thr) was observed as a singleton in an East-Asian individual. For CD209L, we identified 64 SNPs and 2 indels, including 4 nonsynonymous and 62 noncoding variants. The four nonsynonymous variants were located in different exons: SNP 141 (Thr→Ala) in exon 2, SNP 3476 (Asp→Asn) in exon 5, SNP 4268 (Thr→Ala) in exon 6, and SNP 5580 (Arg→Gln) in exon 7. All these mutations were singletons except SNP 3476, which presented high frequencies for its derived allele in all geographic regions: 97.6% in Africans, 57% in Europeans, and 77% in East Asians. All variable sites were in Hardy-Weinberg equilibrium for both CD209 and CD209L, after Bonferroni correction for multiple testing. The allelic composition of CD209 and CD209L haplotypes and their frequency distribution in the three major ethnic groups is illustrated in figure 2 , along with the haplotype composed of the ancestral allelic state of each SNP inferred from chimpanzee data. For CD209, we identified 42 different haplotypes, with an overall heterozygosity of 84% (table 2 ). Three major haplotypes (H2, H29, and H40) accounted for ∼50% of the African variability, whereas they were at very low frequency (H2 at ∼5%) or absent (H29 and H40) in Europeans and East Asians (fig. 2A). In turn, the two haplotypes (H1 and H3) that accounted for 58% and 83% of the European and East Asian variability, respectively, were observed at very low frequency (H1 at 6%) or even absent (H3) in Africa. However, H3, which had a frequency of 36% and 20% in Europe and East Asia, respectively, is just a one-step mutation (SNP 871) from H2, the most frequent haplotype in the African sample. The most interesting observation of the CD209 haplotype variability was the presence of a highly divergent haplotype cluster. This cluster, which contains haplotypes 40–42 (referred to here as “cluster A”), differs from all other haplotypes (referred to here as “cluster B”) by 35 fixed positions (fig. 2A). Cluster A is Africa specific and is present at a frequency of ∼15%, whereas cluster B is present in the remaining African and all non-African samples. It is worth noting that three (SNPs 1839, 1888, and 1908) of the five nonsynonymous mutations identified for this gene are unique to cluster A. In all cases, these three mutations were segregating together, with the exception of one haplotype, H41, which does not contain the SNP 1839. Samples from cluster A are geographically widespread over the entire African continent (i.e., two San from Namibia, three Bantus from Gabon and two from South Africa, three Yorubans from Nigeria, and two Mandenka from Senegal). For CD209L, 74 different haplotypes were observed (fig. 2B), with an overall heterozygosity of 94% (table 2). Only one haplotype (H38) at a frequency of ∼15% was shared in the three continental regions. Figure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. Table 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H CD209: African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 CD209L: African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 Note.— The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. a HD = haplotype diversity (%). b Nucleotide diversity per base pair (×10−4). c Watterson's estimator per base pair (×10−4). d .02<P≤.05. e P≤.02. To assess the degree of population differentiation, if any, we computed Wright's F ST (Wright 1931), using haplotype frequencies. F ST estimates were significant (P<.0001) for all population comparisons, indicating continental differentiation for both CD209 and CD209L. However, substantial differences were observed between the two genes: the overall F ST for CD209 among Africans, Europeans, and East-Asians was 0.15, whereas CD209L presented a threefold lower F ST value of 0.05. For both genes, the larger F ST values were observed between African and East Asian populations, with F ST values of 0.22 for CD209 and 0.07 for CD209L. Levels of Polymorphism and Divergence between Humans and Chimpanzees The average nucleotide diversity (π) was strikingly different, both between the two genes and among populations (table 2). Globally, π values were three- to fivefold lower for CD209 (3–7×10−4) than for CD209L (∼16 × 10−4), except for African populations, for whom the CD209 π value was unusually high (26×10−4) because of the presence of the highly divergent cluster A. Indeed, when cluster A was excluded from the analysis, the African π value dropped to 8×10−4. To estimate the substitution rate of each region and evince possible mutational differences that could explain the strong contrast observed in nucleotide-diversity patterns, we determined the human-chimpanzee divergence for both genes. The average net number of differences between the two species was 77.3 substitutions (or 0.0157 substitutions per nucleotide) for CD209 and 90.6 substitutions (or 0.0171 substitutions per nucleotide) for CD209L. Since the human-chimpanzee speciation occurred 5 MYA, we obtained similar nucleotide-substitution rates per site per year (CD209, 1.57×10−9; CD209L, 1.70×10−9). LD To assess the patterns of LD in the CD209/CD209L region, haplotypes for the entire genomic region were reconstructed using markers with an MAF of 10%. D′ measures among these markers were estimated for African and non-African populations independently; the graphical representation of LD levels is illustrated in figure 3 . Two distinct regions, which correspond to either CD209 or CD209L, showed strong LD and are separated by a boundary that corresponds to the intergenic region. For CD209, a block of intragenic LD was observed in both African and non-African populations. For the African sample, 89% of all pairwise comparisons indicated significant levels of LD, whereas, for non-Africans, all D′ pairwise comparisons were significant. The magnitude of intragenic recombination (and/or gene conversion) of CD209L was slightly higher than for CD209. Nevertheless, considerable and significant levels of LD were observed between sites: 83% of all LD pairwise comparisons were significant in the African group, and 99% were in the non-African sample. Overall, CD209 exhibited a blocklike structure in both groups, whereas CD209L presented lower—although mostly significant—LD levels, in particular among the non-African sample. Figure 3 Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. The strong decay in LD observed in the intergenic region (fig. 3), which spans only ∼14 kb, suggests the occurrence of a number of recombination events. To test the hypothesis of a possible recombination hotspot situated within this region, recombination parameters across the entire CD209/CD209L region (∼26 kb) were computed for the three populations, by use of the recombination model implemented in Phase (v.2.1.1) (fig. 4 ). This model (Stephens and Donnelly 2003) estimates the position and relative intensity of the hotspot (λ) as compared with the background population recombination rate (ρ) (see the “Material and Methods” section). A λ value of 1 corresponds to absence of recombination-rate variation, whereas λ values >1 indicate the presence of a hotspot. The model detected the occurrence of a hotspot in the intergenic region, with Africans presenting a λ of 18, whereas Europeans and East Asians exhibited λ values of 63 and 53, respectively (fig. 4). We estimated the posterior probabilities of a hotspot of any kind, Pr(λ>1), and of at least 10 times the background recombination rate, Pr(λ>10). Pr(λ>1) was 100% for all population groups, and Pr(λ>10) was 64% for Africans, 97% for Europeans, and 92% for East Asians. Thus, our data clearly indicate a relative increase of the recombination levels between the two genes, which suggests the occurrence of a hotspot of recombination, the magnitude of which varies among the major ethnic groups. However, our data do not include intergenic SNPs; therefore, the exact location and width of the recombination hotspot within the intergenic region remains unclear, since this observation would be consistent with either an intense narrow hotspot or a weaker but wider hotspot. Figure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. Neutrality Tests The identification of a strong decay in LD between CD209 and CD209L facilitated the interpretation of neutrality tests, because the noise introduced by hitchhiking effects between the genes is reduced. We applied Tajima's D and Fay and Wu's H tests to determine whether these statistics significantly deviated from expectations under neutrality, using both coalescent simulations and the empirical distribution obtained from Akey et al. (2004). Globally, Tajima's D test indicated different tendencies for the two genes (table 2). CD209 always yielded negative values for Tajima's D but never achieved significance to reject the hypothesis of neutrality, whereas CD209L yielded significantly positive values for non-African populations, with use of both coalescent simulations and the empirical distribution. For Fay and Wu's H test, the hypothesis of neutrality was rejected for CD209 in the African and East Asian samples (table 2). To evaluate the selective pressures at the protein level, we performed two interspecies tests: K A/K S, which gives the ratio of nonsynonymous and synonymous changes between species, and the McDonald-Kreitman test, which tests the null hypothesis that the ratio of the number of fixed differences to polymorphisms is the same for both nonsynonymous and synonymous mutations. For the K A/K S test, CD209 and CD209L showed similar values, 0.34 and 0.37, respectively. For the McDonald-Kreitman test, the hypothesis of neutrality was rejected for only CD209, because of a clear lack of nonsynonymous polymorphic sites (table 3 ). Table 3 McDonald-Kreitman Test Results No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. a Mutations in introns are considered synonymous. Neck-Region Length Variation in Worldwide Populations The identical genomic organization of CD209 and CD209L is extended to the neck region, which, in both genes, encodes a track of seven coding repeats of 23 aa each (fig. 1) (Soilleux et al. 2000). A previous study has shown that the length of the neck region of CD209L varied between individuals of European descent (Bashirova et al. 2001). To investigate the degree of polymorphism of the neck region in both CD209 and CD209L, we genotyped it in the entire HGDP-CEPH panel (1,064 individuals from 52 worldwide populations). Striking differences were observed between the two genes (see fig. 5 and table 4 for detailed allele frequencies in each population). For CD209, virtually no variation was observed, and the 7-repeat allele accounted for 99% of the total variability. Despite this limited variation, eight different alleles were observed, with an allele size range of 2–10 repeats, not including a 9-repeat allele. The geographic region that presented the highest variability was the Middle East, with five of the eight different alleles observed (fig. 5A and table 4). For CD209L, a completely different pattern emerged, with strong variation in allelic frequencies of different repeat numbers. Of the seven alleles observed (from 4–10-repeat allele size classes), the three most common overall were the 7- (57.42%), the 5- (23.92%), and the 6- (11.37%) repeat alleles. European, Asian, and Pacific populations presented a mosaic composition of different allelic classes, whereas 7- and 6-repeat alleles accounted for most (96%) of the African diversity (fig. 5B). The strong difference in the neck-region lengths between the two genes was consequently visible in the heterozygosity values: CD209 exhibited an overall heterozygosity of only 2%, whereas CD209L presented a value of 54% (table 5 table 5). Our results showed that the levels of heterozygosity observed at CD209 were considerably lower than expected, regardless of the mutation model considered (i.e., Infinite Site or Stepwise Mutation Models) (table 5). In strong contrast, although not statistically significant for individual populations, CD209L exhibited a pattern of an excess of heterozygosity in all populations. Figure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. Table 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 a Heterozygosity values. b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. Table 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. a Under the infinite-site model. b Probability of the observed heterozygosity under the infinite-site model. c Probability of the observed heterozygosity under the stepwise mutational model. Time of the Most Recent Common Ancestor for CD209 The low levels of intragenic recombination observed in CD209 allowed maximum-likelihood coalescent analysis (Griffiths and Tavare 1994) for estimation of the time scale of the origin and evolution of this gene. Since this method assumes an infinite-site model without recombination, the same analysis for CD209L was not conducted because of the substantial amount of recombinant haplotypes observed. For CD209, only 29 of the 254 chromosomes analyzed had to be excluded, as did a single segregating site (SNP 939). The resulting CD209 gene tree estimate, rooted with the chimpanzee sequence (i.e., the chimpanzee sequence was used to define ancestral/derived status of human mutations), is shown in figure 6 . The tree is partitioned into two deep branches that correspond to haplotype clusters A and B. African samples were observed in both sides of the deepest node of the tree (i.e., in both clusters A and B), whereas non-African samples are restricted to one branch of the tree (i.e., cluster B). The maximum-likelihood estimate of θ (θML) for CD209 was 8.4. On the basis of this θML value and the estimated mutation rate (1.54×10−4 per gene per generation), the effective population size (N e) was 13,636, a value comparable to most figures reported in the literature (for a review, see Tishkoff and Verrelli [2003]). The T MRCA of the CD209 tree was then estimated at 2.8±0.22 MYA, one of the oldest T MRCA values estimated so far in the human genome (Excoffier 2002). Figure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. |
title | Results |
p | We determined sequence diversity in the CD209 and CD209L genes (fig. 1) as well as length variation of the neck region in 254 chromosomes originating from three major ethnic groups: sub-Saharan Africans, Europeans, and East Asians. In addition, the orthologous sequences were obtained in four chimpanzees, to infer the ancestral state at each site, to estimate the divergence between humans and chimpanzees, and to perform a number of interspecies neutrality tests. |
sec | Patterns of Nucleotide and Haplotype Diversity in the CD209/CD209L Region For CD209, we identified a total of 79 SNPs and 2 indels, including 5 nonsynonymous, 5 synonymous, and 71 noncoding variants. The five nonsynonymous SNPs were all located in the neck region (exon 4): SNPs 1839 (Arg→Gln), 1888 (Glu→Asp), and 1908 (Arg→Gln) achieved a frequency of ∼15%, and SNP 1970 (Leu→Val), a frequency of 6%. These mutations were restricted to the African sample. SNP 1472 (Ala→Thr) was observed as a singleton in an East-Asian individual. For CD209L, we identified 64 SNPs and 2 indels, including 4 nonsynonymous and 62 noncoding variants. The four nonsynonymous variants were located in different exons: SNP 141 (Thr→Ala) in exon 2, SNP 3476 (Asp→Asn) in exon 5, SNP 4268 (Thr→Ala) in exon 6, and SNP 5580 (Arg→Gln) in exon 7. All these mutations were singletons except SNP 3476, which presented high frequencies for its derived allele in all geographic regions: 97.6% in Africans, 57% in Europeans, and 77% in East Asians. All variable sites were in Hardy-Weinberg equilibrium for both CD209 and CD209L, after Bonferroni correction for multiple testing. The allelic composition of CD209 and CD209L haplotypes and their frequency distribution in the three major ethnic groups is illustrated in figure 2 , along with the haplotype composed of the ancestral allelic state of each SNP inferred from chimpanzee data. For CD209, we identified 42 different haplotypes, with an overall heterozygosity of 84% (table 2 ). Three major haplotypes (H2, H29, and H40) accounted for ∼50% of the African variability, whereas they were at very low frequency (H2 at ∼5%) or absent (H29 and H40) in Europeans and East Asians (fig. 2A). In turn, the two haplotypes (H1 and H3) that accounted for 58% and 83% of the European and East Asian variability, respectively, were observed at very low frequency (H1 at 6%) or even absent (H3) in Africa. However, H3, which had a frequency of 36% and 20% in Europe and East Asia, respectively, is just a one-step mutation (SNP 871) from H2, the most frequent haplotype in the African sample. The most interesting observation of the CD209 haplotype variability was the presence of a highly divergent haplotype cluster. This cluster, which contains haplotypes 40–42 (referred to here as “cluster A”), differs from all other haplotypes (referred to here as “cluster B”) by 35 fixed positions (fig. 2A). Cluster A is Africa specific and is present at a frequency of ∼15%, whereas cluster B is present in the remaining African and all non-African samples. It is worth noting that three (SNPs 1839, 1888, and 1908) of the five nonsynonymous mutations identified for this gene are unique to cluster A. In all cases, these three mutations were segregating together, with the exception of one haplotype, H41, which does not contain the SNP 1839. Samples from cluster A are geographically widespread over the entire African continent (i.e., two San from Namibia, three Bantus from Gabon and two from South Africa, three Yorubans from Nigeria, and two Mandenka from Senegal). For CD209L, 74 different haplotypes were observed (fig. 2B), with an overall heterozygosity of 94% (table 2). Only one haplotype (H38) at a frequency of ∼15% was shared in the three continental regions. Figure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. Table 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H CD209: African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 CD209L: African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 Note.— The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. a HD = haplotype diversity (%). b Nucleotide diversity per base pair (×10−4). c Watterson's estimator per base pair (×10−4). d .02<P≤.05. e P≤.02. To assess the degree of population differentiation, if any, we computed Wright's F ST (Wright 1931), using haplotype frequencies. F ST estimates were significant (P<.0001) for all population comparisons, indicating continental differentiation for both CD209 and CD209L. However, substantial differences were observed between the two genes: the overall F ST for CD209 among Africans, Europeans, and East-Asians was 0.15, whereas CD209L presented a threefold lower F ST value of 0.05. For both genes, the larger F ST values were observed between African and East Asian populations, with F ST values of 0.22 for CD209 and 0.07 for CD209L. |
title | Patterns of Nucleotide and Haplotype Diversity in the CD209/CD209L Region |
p | For CD209, we identified a total of 79 SNPs and 2 indels, including 5 nonsynonymous, 5 synonymous, and 71 noncoding variants. The five nonsynonymous SNPs were all located in the neck region (exon 4): SNPs 1839 (Arg→Gln), 1888 (Glu→Asp), and 1908 (Arg→Gln) achieved a frequency of ∼15%, and SNP 1970 (Leu→Val), a frequency of 6%. These mutations were restricted to the African sample. SNP 1472 (Ala→Thr) was observed as a singleton in an East-Asian individual. For CD209L, we identified 64 SNPs and 2 indels, including 4 nonsynonymous and 62 noncoding variants. The four nonsynonymous variants were located in different exons: SNP 141 (Thr→Ala) in exon 2, SNP 3476 (Asp→Asn) in exon 5, SNP 4268 (Thr→Ala) in exon 6, and SNP 5580 (Arg→Gln) in exon 7. All these mutations were singletons except SNP 3476, which presented high frequencies for its derived allele in all geographic regions: 97.6% in Africans, 57% in Europeans, and 77% in East Asians. All variable sites were in Hardy-Weinberg equilibrium for both CD209 and CD209L, after Bonferroni correction for multiple testing. |
p | The allelic composition of CD209 and CD209L haplotypes and their frequency distribution in the three major ethnic groups is illustrated in figure 2 , along with the haplotype composed of the ancestral allelic state of each SNP inferred from chimpanzee data. For CD209, we identified 42 different haplotypes, with an overall heterozygosity of 84% (table 2 ). Three major haplotypes (H2, H29, and H40) accounted for ∼50% of the African variability, whereas they were at very low frequency (H2 at ∼5%) or absent (H29 and H40) in Europeans and East Asians (fig. 2A). In turn, the two haplotypes (H1 and H3) that accounted for 58% and 83% of the European and East Asian variability, respectively, were observed at very low frequency (H1 at 6%) or even absent (H3) in Africa. However, H3, which had a frequency of 36% and 20% in Europe and East Asia, respectively, is just a one-step mutation (SNP 871) from H2, the most frequent haplotype in the African sample. The most interesting observation of the CD209 haplotype variability was the presence of a highly divergent haplotype cluster. This cluster, which contains haplotypes 40–42 (referred to here as “cluster A”), differs from all other haplotypes (referred to here as “cluster B”) by 35 fixed positions (fig. 2A). Cluster A is Africa specific and is present at a frequency of ∼15%, whereas cluster B is present in the remaining African and all non-African samples. It is worth noting that three (SNPs 1839, 1888, and 1908) of the five nonsynonymous mutations identified for this gene are unique to cluster A. In all cases, these three mutations were segregating together, with the exception of one haplotype, H41, which does not contain the SNP 1839. Samples from cluster A are geographically widespread over the entire African continent (i.e., two San from Namibia, three Bantus from Gabon and two from South Africa, three Yorubans from Nigeria, and two Mandenka from Senegal). For CD209L, 74 different haplotypes were observed (fig. 2B), with an overall heterozygosity of 94% (table 2). Only one haplotype (H38) at a frequency of ∼15% was shared in the three continental regions. Figure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. Table 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H CD209: African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 CD209L: African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 Note.— The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. a HD = haplotype diversity (%). b Nucleotide diversity per base pair (×10−4). c Watterson's estimator per base pair (×10−4). d .02<P≤.05. e P≤.02. |
figure | Figure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. |
label | Figure 2 |
caption | Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. |
p | Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion. |
table-wrap | Table 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H CD209: African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 CD209L: African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 Note.— The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. a HD = haplotype diversity (%). b Nucleotide diversity per base pair (×10−4). c Watterson's estimator per base pair (×10−4). d .02<P≤.05. e P≤.02. |
label | Table 2 |
caption | Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations |
p | Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations |
table | Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H CD209: African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 CD209L: African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 |
tr | Gene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H |
th | Gene and Population |
th | No. of Chromosomes |
th | No. of Segregating Sites |
th | No. of Haplotypes |
th | HDa±SD |
th | πb ±SD |
th | θwc |
th | Tajima's D |
th | Fay and Wu's H |
tr | CD209: |
td | CD209: |
tr | African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d |
td | African |
td | 82 |
td | 70 |
td | 26 |
td | 91.8 ± 1.6 |
td | 26 ± 3.8 |
td | 25.3 |
td | −.05 |
td | −19.45d |
tr | European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26 |
td | European |
td | 86 |
td | 18 |
td | 14 |
td | 79.6 ± 3.0 |
td | 6.4 ± .6 |
td | 6.5 |
td | −.04 |
td | −.26 |
tr | East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d |
td | East Asian |
td | 86 |
td | 12 |
td | 11 |
td | 56.7 ± 5.5 |
td | 3.3 ± .5 |
td | 4.3 |
td | −.65 |
td | −3.82d |
tr | Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3 |
td | Total |
td | 254 |
td | 79 |
td | 42 |
td | 84.5 ± 1.6 |
td | 13 ± 1.7 |
td | 23.3 |
tr | CD209L: |
td | CD209L: |
tr | African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52 |
td | African |
td | 82 |
td | 51 |
td | 40 |
td | 94.9 ± 1.2 |
td | 16.1 ± .9 |
td | 18.7 |
td | −.49 |
td | −1.52 |
tr | European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61 |
td | European |
td | 86 |
td | 29 |
td | 23 |
td | 88.8 ± 1.9 |
td | 17.7 ± 1.0 |
td | 10.5 |
td | 2.01e |
td | −.61 |
tr | East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43 |
td | East Asian |
td | 86 |
td | 27 |
td | 19 |
td | 86.4 ± 1.8 |
td | 16.0 ± .5 |
td | 9.8 |
td | 1.85d |
td | −.43 |
tr | Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8 |
td | Total |
td | 254 |
td | 63 |
td | 74 |
td | 93.6 ± .7 |
td | 17.7 ± .5 |
td | 18.8 |
table-wrap-foot | Note.— The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. |
footnote | Note.— |
p | Note.— |
footnote | The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. |
p | The values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively. |
table-wrap-foot | a HD = haplotype diversity (%). |
footnote | a HD = haplotype diversity (%). |
label | a |
p | HD = haplotype diversity (%). |
table-wrap-foot | b Nucleotide diversity per base pair (×10−4). |
footnote | b Nucleotide diversity per base pair (×10−4). |
label | b |
p | Nucleotide diversity per base pair (×10−4). |
table-wrap-foot | c Watterson's estimator per base pair (×10−4). |
footnote | c Watterson's estimator per base pair (×10−4). |
label | c |
p | Watterson's estimator per base pair (×10−4). |
table-wrap-foot | d .02<P≤.05. |
footnote | d .02<P≤.05. |
label | d |
p | .02<P≤.05. |
table-wrap-foot | e P≤.02. |
footnote | e P≤.02. |
label | e |
p | P≤.02. |
p | To assess the degree of population differentiation, if any, we computed Wright's F ST (Wright 1931), using haplotype frequencies. F ST estimates were significant (P<.0001) for all population comparisons, indicating continental differentiation for both CD209 and CD209L. However, substantial differences were observed between the two genes: the overall F ST for CD209 among Africans, Europeans, and East-Asians was 0.15, whereas CD209L presented a threefold lower F ST value of 0.05. For both genes, the larger F ST values were observed between African and East Asian populations, with F ST values of 0.22 for CD209 and 0.07 for CD209L. |
sec | Levels of Polymorphism and Divergence between Humans and Chimpanzees The average nucleotide diversity (π) was strikingly different, both between the two genes and among populations (table 2). Globally, π values were three- to fivefold lower for CD209 (3–7×10−4) than for CD209L (∼16 × 10−4), except for African populations, for whom the CD209 π value was unusually high (26×10−4) because of the presence of the highly divergent cluster A. Indeed, when cluster A was excluded from the analysis, the African π value dropped to 8×10−4. To estimate the substitution rate of each region and evince possible mutational differences that could explain the strong contrast observed in nucleotide-diversity patterns, we determined the human-chimpanzee divergence for both genes. The average net number of differences between the two species was 77.3 substitutions (or 0.0157 substitutions per nucleotide) for CD209 and 90.6 substitutions (or 0.0171 substitutions per nucleotide) for CD209L. Since the human-chimpanzee speciation occurred 5 MYA, we obtained similar nucleotide-substitution rates per site per year (CD209, 1.57×10−9; CD209L, 1.70×10−9). |
title | Levels of Polymorphism and Divergence between Humans and Chimpanzees |
p | The average nucleotide diversity (π) was strikingly different, both between the two genes and among populations (table 2). Globally, π values were three- to fivefold lower for CD209 (3–7×10−4) than for CD209L (∼16 × 10−4), except for African populations, for whom the CD209 π value was unusually high (26×10−4) because of the presence of the highly divergent cluster A. Indeed, when cluster A was excluded from the analysis, the African π value dropped to 8×10−4. To estimate the substitution rate of each region and evince possible mutational differences that could explain the strong contrast observed in nucleotide-diversity patterns, we determined the human-chimpanzee divergence for both genes. The average net number of differences between the two species was 77.3 substitutions (or 0.0157 substitutions per nucleotide) for CD209 and 90.6 substitutions (or 0.0171 substitutions per nucleotide) for CD209L. Since the human-chimpanzee speciation occurred 5 MYA, we obtained similar nucleotide-substitution rates per site per year (CD209, 1.57×10−9; CD209L, 1.70×10−9). |
sec | LD To assess the patterns of LD in the CD209/CD209L region, haplotypes for the entire genomic region were reconstructed using markers with an MAF of 10%. D′ measures among these markers were estimated for African and non-African populations independently; the graphical representation of LD levels is illustrated in figure 3 . Two distinct regions, which correspond to either CD209 or CD209L, showed strong LD and are separated by a boundary that corresponds to the intergenic region. For CD209, a block of intragenic LD was observed in both African and non-African populations. For the African sample, 89% of all pairwise comparisons indicated significant levels of LD, whereas, for non-Africans, all D′ pairwise comparisons were significant. The magnitude of intragenic recombination (and/or gene conversion) of CD209L was slightly higher than for CD209. Nevertheless, considerable and significant levels of LD were observed between sites: 83% of all LD pairwise comparisons were significant in the African group, and 99% were in the non-African sample. Overall, CD209 exhibited a blocklike structure in both groups, whereas CD209L presented lower—although mostly significant—LD levels, in particular among the non-African sample. Figure 3 Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. The strong decay in LD observed in the intergenic region (fig. 3), which spans only ∼14 kb, suggests the occurrence of a number of recombination events. To test the hypothesis of a possible recombination hotspot situated within this region, recombination parameters across the entire CD209/CD209L region (∼26 kb) were computed for the three populations, by use of the recombination model implemented in Phase (v.2.1.1) (fig. 4 ). This model (Stephens and Donnelly 2003) estimates the position and relative intensity of the hotspot (λ) as compared with the background population recombination rate (ρ) (see the “Material and Methods” section). A λ value of 1 corresponds to absence of recombination-rate variation, whereas λ values >1 indicate the presence of a hotspot. The model detected the occurrence of a hotspot in the intergenic region, with Africans presenting a λ of 18, whereas Europeans and East Asians exhibited λ values of 63 and 53, respectively (fig. 4). We estimated the posterior probabilities of a hotspot of any kind, Pr(λ>1), and of at least 10 times the background recombination rate, Pr(λ>10). Pr(λ>1) was 100% for all population groups, and Pr(λ>10) was 64% for Africans, 97% for Europeans, and 92% for East Asians. Thus, our data clearly indicate a relative increase of the recombination levels between the two genes, which suggests the occurrence of a hotspot of recombination, the magnitude of which varies among the major ethnic groups. However, our data do not include intergenic SNPs; therefore, the exact location and width of the recombination hotspot within the intergenic region remains unclear, since this observation would be consistent with either an intense narrow hotspot or a weaker but wider hotspot. Figure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. |
title | LD |
p | To assess the patterns of LD in the CD209/CD209L region, haplotypes for the entire genomic region were reconstructed using markers with an MAF of 10%. D′ measures among these markers were estimated for African and non-African populations independently; the graphical representation of LD levels is illustrated in figure 3 . Two distinct regions, which correspond to either CD209 or CD209L, showed strong LD and are separated by a boundary that corresponds to the intergenic region. For CD209, a block of intragenic LD was observed in both African and non-African populations. For the African sample, 89% of all pairwise comparisons indicated significant levels of LD, whereas, for non-Africans, all D′ pairwise comparisons were significant. The magnitude of intragenic recombination (and/or gene conversion) of CD209L was slightly higher than for CD209. Nevertheless, considerable and significant levels of LD were observed between sites: 83% of all LD pairwise comparisons were significant in the African group, and 99% were in the non-African sample. Overall, CD209 exhibited a blocklike structure in both groups, whereas CD209L presented lower—although mostly significant—LD levels, in particular among the non-African sample. Figure 3 Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. |
figure | Figure 3 Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. |
label | Figure 3 |
caption | Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. |
p | Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF>10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF>10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF>10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF>10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%. |
p | The strong decay in LD observed in the intergenic region (fig. 3), which spans only ∼14 kb, suggests the occurrence of a number of recombination events. To test the hypothesis of a possible recombination hotspot situated within this region, recombination parameters across the entire CD209/CD209L region (∼26 kb) were computed for the three populations, by use of the recombination model implemented in Phase (v.2.1.1) (fig. 4 ). This model (Stephens and Donnelly 2003) estimates the position and relative intensity of the hotspot (λ) as compared with the background population recombination rate (ρ) (see the “Material and Methods” section). A λ value of 1 corresponds to absence of recombination-rate variation, whereas λ values >1 indicate the presence of a hotspot. The model detected the occurrence of a hotspot in the intergenic region, with Africans presenting a λ of 18, whereas Europeans and East Asians exhibited λ values of 63 and 53, respectively (fig. 4). We estimated the posterior probabilities of a hotspot of any kind, Pr(λ>1), and of at least 10 times the background recombination rate, Pr(λ>10). Pr(λ>1) was 100% for all population groups, and Pr(λ>10) was 64% for Africans, 97% for Europeans, and 92% for East Asians. Thus, our data clearly indicate a relative increase of the recombination levels between the two genes, which suggests the occurrence of a hotspot of recombination, the magnitude of which varies among the major ethnic groups. However, our data do not include intergenic SNPs; therefore, the exact location and width of the recombination hotspot within the intergenic region remains unclear, since this observation would be consistent with either an intense narrow hotspot or a weaker but wider hotspot. Figure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. |
figure | Figure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. |
label | Figure 4 |
caption | Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. |
p | Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ>1 and λ>10 are also reported in the key. |
sec | Neutrality Tests The identification of a strong decay in LD between CD209 and CD209L facilitated the interpretation of neutrality tests, because the noise introduced by hitchhiking effects between the genes is reduced. We applied Tajima's D and Fay and Wu's H tests to determine whether these statistics significantly deviated from expectations under neutrality, using both coalescent simulations and the empirical distribution obtained from Akey et al. (2004). Globally, Tajima's D test indicated different tendencies for the two genes (table 2). CD209 always yielded negative values for Tajima's D but never achieved significance to reject the hypothesis of neutrality, whereas CD209L yielded significantly positive values for non-African populations, with use of both coalescent simulations and the empirical distribution. For Fay and Wu's H test, the hypothesis of neutrality was rejected for CD209 in the African and East Asian samples (table 2). To evaluate the selective pressures at the protein level, we performed two interspecies tests: K A/K S, which gives the ratio of nonsynonymous and synonymous changes between species, and the McDonald-Kreitman test, which tests the null hypothesis that the ratio of the number of fixed differences to polymorphisms is the same for both nonsynonymous and synonymous mutations. For the K A/K S test, CD209 and CD209L showed similar values, 0.34 and 0.37, respectively. For the McDonald-Kreitman test, the hypothesis of neutrality was rejected for only CD209, because of a clear lack of nonsynonymous polymorphic sites (table 3 ). Table 3 McDonald-Kreitman Test Results No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. a Mutations in introns are considered synonymous. |
title | Neutrality Tests |
p | The identification of a strong decay in LD between CD209 and CD209L facilitated the interpretation of neutrality tests, because the noise introduced by hitchhiking effects between the genes is reduced. We applied Tajima's D and Fay and Wu's H tests to determine whether these statistics significantly deviated from expectations under neutrality, using both coalescent simulations and the empirical distribution obtained from Akey et al. (2004). Globally, Tajima's D test indicated different tendencies for the two genes (table 2). CD209 always yielded negative values for Tajima's D but never achieved significance to reject the hypothesis of neutrality, whereas CD209L yielded significantly positive values for non-African populations, with use of both coalescent simulations and the empirical distribution. For Fay and Wu's H test, the hypothesis of neutrality was rejected for CD209 in the African and East Asian samples (table 2). |
p | To evaluate the selective pressures at the protein level, we performed two interspecies tests: K A/K S, which gives the ratio of nonsynonymous and synonymous changes between species, and the McDonald-Kreitman test, which tests the null hypothesis that the ratio of the number of fixed differences to polymorphisms is the same for both nonsynonymous and synonymous mutations. For the K A/K S test, CD209 and CD209L showed similar values, 0.34 and 0.37, respectively. For the McDonald-Kreitman test, the hypothesis of neutrality was rejected for only CD209, because of a clear lack of nonsynonymous polymorphic sites (table 3 ). Table 3 McDonald-Kreitman Test Results No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. a Mutations in introns are considered synonymous. |
table-wrap | Table 3 McDonald-Kreitman Test Results No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. a Mutations in introns are considered synonymous. |
label | Table 3 |
caption | McDonald-Kreitman Test Results |
p | McDonald-Kreitman Test Results |
table | No. of Substitutions andPValue for Exonic Region Only Entire Sequencea Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P CD209: .04 .009 Fixed 4 5 51 5 Polymorphic 6 0 86 0 CD209L: .23 1 Fixed 5 6 78 6 Polymorphic 0 4 65 4 |
tr | No. of Substitutions andPValue for |
th | No. of Substitutions andPValue for |
tr | Exonic Region Only Entire Sequencea |
th | Exonic Region Only |
th | Entire Sequencea |
tr | Gene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P |
th | Gene and Type of Site |
th | Synonymous |
th | Nonsynonymous |
th | P |
th | Synonymous |
th | Nonsynonymous |
th | P |
tr | CD209: .04 .009 |
td | CD209: |
td | .04 |
td | .009 |
tr | Fixed 4 5 51 5 |
td | Fixed |
td | 4 |
td | 5 |
td | 51 |
td | 5 |
tr | Polymorphic 6 0 86 0 |
td | Polymorphic |
td | 6 |
td | 0 |
td | 86 |
td | 0 |
tr | CD209L: .23 1 |
td | CD209L: |
td | .23 |
td | 1 |
tr | Fixed 5 6 78 6 |
td | Fixed |
td | 5 |
td | 6 |
td | 78 |
td | 6 |
tr | Polymorphic 0 4 65 4 |
td | Polymorphic |
td | 0 |
td | 4 |
td | 65 |
td | 4 |
table-wrap-foot | Note.— The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. |
footnote | Note.— |
p | Note.— |
footnote | The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. |
p | The highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics. |
table-wrap-foot | a Mutations in introns are considered synonymous. |
footnote | a Mutations in introns are considered synonymous. |
label | a |
p | Mutations in introns are considered synonymous. |
sec | Neck-Region Length Variation in Worldwide Populations The identical genomic organization of CD209 and CD209L is extended to the neck region, which, in both genes, encodes a track of seven coding repeats of 23 aa each (fig. 1) (Soilleux et al. 2000). A previous study has shown that the length of the neck region of CD209L varied between individuals of European descent (Bashirova et al. 2001). To investigate the degree of polymorphism of the neck region in both CD209 and CD209L, we genotyped it in the entire HGDP-CEPH panel (1,064 individuals from 52 worldwide populations). Striking differences were observed between the two genes (see fig. 5 and table 4 for detailed allele frequencies in each population). For CD209, virtually no variation was observed, and the 7-repeat allele accounted for 99% of the total variability. Despite this limited variation, eight different alleles were observed, with an allele size range of 2–10 repeats, not including a 9-repeat allele. The geographic region that presented the highest variability was the Middle East, with five of the eight different alleles observed (fig. 5A and table 4). For CD209L, a completely different pattern emerged, with strong variation in allelic frequencies of different repeat numbers. Of the seven alleles observed (from 4–10-repeat allele size classes), the three most common overall were the 7- (57.42%), the 5- (23.92%), and the 6- (11.37%) repeat alleles. European, Asian, and Pacific populations presented a mosaic composition of different allelic classes, whereas 7- and 6-repeat alleles accounted for most (96%) of the African diversity (fig. 5B). The strong difference in the neck-region lengths between the two genes was consequently visible in the heterozygosity values: CD209 exhibited an overall heterozygosity of only 2%, whereas CD209L presented a value of 54% (table 5 table 5). Our results showed that the levels of heterozygosity observed at CD209 were considerably lower than expected, regardless of the mutation model considered (i.e., Infinite Site or Stepwise Mutation Models) (table 5). In strong contrast, although not statistically significant for individual populations, CD209L exhibited a pattern of an excess of heterozygosity in all populations. Figure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. Table 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 a Heterozygosity values. b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. Table 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. a Under the infinite-site model. b Probability of the observed heterozygosity under the infinite-site model. c Probability of the observed heterozygosity under the stepwise mutational model. |
title | Neck-Region Length Variation in Worldwide Populations |
p | The identical genomic organization of CD209 and CD209L is extended to the neck region, which, in both genes, encodes a track of seven coding repeats of 23 aa each (fig. 1) (Soilleux et al. 2000). A previous study has shown that the length of the neck region of CD209L varied between individuals of European descent (Bashirova et al. 2001). To investigate the degree of polymorphism of the neck region in both CD209 and CD209L, we genotyped it in the entire HGDP-CEPH panel (1,064 individuals from 52 worldwide populations). Striking differences were observed between the two genes (see fig. 5 and table 4 for detailed allele frequencies in each population). For CD209, virtually no variation was observed, and the 7-repeat allele accounted for 99% of the total variability. Despite this limited variation, eight different alleles were observed, with an allele size range of 2–10 repeats, not including a 9-repeat allele. The geographic region that presented the highest variability was the Middle East, with five of the eight different alleles observed (fig. 5A and table 4). For CD209L, a completely different pattern emerged, with strong variation in allelic frequencies of different repeat numbers. Of the seven alleles observed (from 4–10-repeat allele size classes), the three most common overall were the 7- (57.42%), the 5- (23.92%), and the 6- (11.37%) repeat alleles. European, Asian, and Pacific populations presented a mosaic composition of different allelic classes, whereas 7- and 6-repeat alleles accounted for most (96%) of the African diversity (fig. 5B). The strong difference in the neck-region lengths between the two genes was consequently visible in the heterozygosity values: CD209 exhibited an overall heterozygosity of only 2%, whereas CD209L presented a value of 54% (table 5 table 5). Our results showed that the levels of heterozygosity observed at CD209 were considerably lower than expected, regardless of the mutation model considered (i.e., Infinite Site or Stepwise Mutation Models) (table 5). In strong contrast, although not statistically significant for individual populations, CD209L exhibited a pattern of an excess of heterozygosity in all populations. Figure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. Table 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 a Heterozygosity values. b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. Table 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. a Under the infinite-site model. b Probability of the observed heterozygosity under the infinite-site model. c Probability of the observed heterozygosity under the stepwise mutational model. |
figure | Figure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. |
label | Figure 5 |
caption | Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. |
p | Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site. |
table-wrap | Table 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 a Heterozygosity values. b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. |
label | Table 4 |
caption | Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations |
p | Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations |
table | CD209 CD209L Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 San Namibia 14 100 35.71 64.29 .71 Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 French France 58 100 48.28 12.07 36.21 3.45 .55 French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 Uygur China 20 100 10.00 65.00 5.00 20.00 .50 East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 Karitiana Brazil 48 100 4.17 56.25 39.58 .54 Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 Pima Mexico 50 100 8.00 64.00 28.00 .36 Mayan Mexico 50 100 16.00 44.00 40.00 .56 Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 |
tr | CD209 CD209L |
th | CD209 |
th | CD209L |
tr | Relative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats |
th | Relative Frequency (%) by No. of Repeats |
th | Relative Frequency (%) by No. of Repeats |
tr | Location and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb |
th | Location and Population |
th | Geographic Origin |
th | No. of Chromosomes |
th | 10 |
th | 8 |
th | 7 |
th | 6 |
th | 5 |
th | 4 |
th | 3 |
th | 2 |
th | HZa |
th | 10 |
th | 9 |
th | 8 |
th | 7 |
th | 6 |
th | 5 |
th | 4 |
th | HZb |
tr | Africa: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50 |
td | Africa: |
td | 254 |
td | .39 |
td | 99.21 |
td | .39 |
td | .02 |
td | .39 |
td | 62.20 |
td | 33.86 |
td | 3.54 |
td | .50 |
tr | Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47 |
td | Biaka Pygmies |
td | Central African Republic |
td | 72 |
td | 100 |
td | 65.28 |
td | 30.56 |
td | 4.17 |
td | .47 |
tr | Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47 |
td | Mbuti Pygmies |
td | Democratic Republic of Congo |
td | 30 |
td | 100 |
td | 43.33 |
td | 56.67 |
td | .47 |
tr | Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83 |
td | Bantu, northeastern |
td | Kenya |
td | 24 |
td | 100 |
td | 50.00 |
td | 37.50 |
td | 12.50 |
td | .83 |
tr | San Namibia 14 100 35.71 64.29 .71 |
td | San |
td | Namibia |
td | 14 |
td | 100 |
td | 35.71 |
td | 64.29 |
td | .71 |
tr | Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32 |
td | Yoruban |
td | Nigeria |
td | 50 |
td | 2.00 |
td | 98.00 |
td | .04 |
td | 2.00 |
td | 78.00 |
td | 20.00 |
td | .32 |
tr | Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54 |
td | Mandenkan |
td | Senegal |
td | 48 |
td | 97.92 |
td | 2.08 |
td | .04 |
td | 66.67 |
td | 29.17 |
td | 4.17 |
td | .54 |
tr | Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50 |
td | Bantu, southeastern/southwestern |
td | South Africa |
td | 16 |
td | 100 |
td | 62.50 |
td | 31.25 |
td | 6.25 |
td | .50 |
tr | Europe: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62 |
td | Europe: |
td | 322 |
td | 99.69 |
td | .31 |
td | .01 |
td | 1.86 |
td | 43.17 |
td | 14.91 |
td | 33.54 |
td | 6.52 |
td | .62 |
tr | French France 58 100 48.28 12.07 36.21 3.45 .55 |
td | French |
td | France |
td | 58 |
td | 100 |
td | 48.28 |
td | 12.07 |
td | 36.21 |
td | 3.45 |
td | .55 |
tr | French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50 |
td | French (Basque) |
td | France |
td | 48 |
td | 100 |
td | 39.58 |
td | 8.33 |
td | 39.58 |
td | 12.50 |
td | .50 |
tr | Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61 |
td | Sardinian |
td | Italy |
td | 72 |
td | 100 |
td | 1.39 |
td | 31.94 |
td | 22.22 |
td | 34.72 |
td | 9.72 |
td | .61 |
tr | North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79 |
td | North Italian |
td | Italy (Bergamo) |
td | 28 |
td | 100 |
td | .00 |
td | 46.43 |
td | 21.43 |
td | 28.57 |
td | 3.57 |
td | .79 |
tr | Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69 |
td | Orcadian |
td | Orkney Islands |
td | 32 |
td | 100 |
td | 9.38 |
td | 46.88 |
td | 9.38 |
td | 28.13 |
td | 6.25 |
td | .69 |
tr | Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84 |
td | Russian |
td | Russia |
td | 50 |
td | 100 |
td | 2.00 |
td | 48.00 |
td | 12.00 |
td | 34.00 |
td | 4.00 |
td | .84 |
tr | Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35 |
td | Adygei |
td | Russian Caucasus |
td | 34 |
td | 97.06 |
td | 2.94 |
td | .06 |
td | 2.94 |
td | 50.00 |
td | 17.65 |
td | 26.47 |
td | 2.94 |
td | .35 |
tr | Middle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61 |
td | Middle East: |
td | 356 |
td | .28 |
td | 97.19 |
td | 1.97 |
td | .28 |
td | .28 |
td | .06 |
td | .84 |
td | .28 |
td | 56.46 |
td | 17.13 |
td | 24.72 |
td | .56 |
td | .61 |
tr | Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67 |
td | Druze |
td | Israel (Carmel) |
td | 96 |
td | 96.88 |
td | 3.13 |
td | .06 |
td | 1.04 |
td | 1.04 |
td | 53.13 |
td | 21.88 |
td | 22.92 |
td | .67 |
tr | Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65 |
td | Palestinian |
td | Israel (Central) |
td | 102 |
td | .98 |
td | 99.02 |
td | .02 |
td | .98 |
td | 56.86 |
td | 14.71 |
td | 27.45 |
td | .65 |
tr | Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51 |
td | Bedouin |
td | Israel (Negev) |
td | 98 |
td | 96.94 |
td | 3.06 |
td | .06 |
td | 1.02 |
td | 58.16 |
td | 14.29 |
td | 24.49 |
td | 2.04 |
td | .51 |
tr | Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60 |
td | Mozabite |
td | Algeria (Mzab) |
td | 60 |
td | 95.00 |
td | 1.67 |
td | 1.67 |
td | 1.67 |
td | .1 |
td | 58.33 |
td | 18.33 |
td | 23.33 |
td | .60 |
tr | Central/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52 |
td | Central/South Asia: |
td | 420 |
td | .24 |
td | 99.29 |
td | .24 |
td | .24 |
td | .01 |
td | 3.81 |
td | .95 |
td | 63.57 |
td | 4.29 |
td | 27.38 |
td | .52 |
tr | Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52 |
td | Pakistanib |
td | Pakistan |
td | 400 |
td | .25 |
td | 99.25 |
td | .25 |
td | .25 |
td | .02 |
td | 3.50 |
td | 1.00 |
td | 63.50 |
td | 4.25 |
td | 27.75 |
td | .52 |
tr | Uygur China 20 100 10.00 65.00 5.00 20.00 .50 |
td | Uygur |
td | China |
td | 20 |
td | 100 |
td | 10.00 |
td | 65.00 |
td | 5.00 |
td | 20.00 |
td | .50 |
tr | East Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47 |
td | East Asia: |
td | 482 |
td | .21 |
td | 99.38 |
td | .21 |
td | .21 |
td | .01 |
td | 11.83 |
td | .21 |
td | 70.12 |
td | 2.49 |
td | 15.35 |
td | .47 |
tr | Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36 |
td | Cambodian |
td | Cambodia |
td | 22 |
td | 100 |
td | 18.18 |
td | 68.18 |
td | 4.55 |
td | 9.09 |
td | .36 |
tr | Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45 |
td | Chinesec |
td | China |
td | 348 |
td | 99.43 |
td | .29 |
td | .29 |
td | .01 |
td | 12.07 |
td | .29 |
td | 71.26 |
td | 2.30 |
td | 14.08 |
td | .45 |
tr | Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58 |
td | Japanese |
td | Japan |
td | 62 |
td | 1.61 |
td | 98.39 |
td | .03 |
td | 6.45 |
td | 62.90 |
td | 3.23 |
td | 27.42 |
td | .58 |
tr | Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48 |
td | Yakut |
td | Siberia |
td | 50 |
td | 100 |
td | 14.00 |
td | 72.00 |
td | 2.00 |
td | 12.00 |
td | .48 |
tr | Oceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72 |
td | Oceania: |
td | 78 |
td | 100 |
td | 3.85 |
td | 26.92 |
td | 30.77 |
td | 21.79 |
td | 16.67 |
td | .72 |
tr | Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65 |
td | Papuan |
td | New Guinea |
td | 34 |
td | 100 |
td | 41.18 |
td | 29.41 |
td | 11.76 |
td | 17.65 |
td | .65 |
tr | NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77 |
td | NAN Melanesian |
td | Bougainville |
td | 44 |
td | 100 |
td | 6.82 |
td | 15.91 |
td | 31.82 |
td | 29.55 |
td | 15.91 |
td | .77 |
tr | Americas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45 |
td | Americas: |
td | 216 |
td | 98.61 |
td | 1.39 |
td | .03 |
td | 8.80 |
td | 43.98 |
td | 47.22 |
td | .45 |
tr | Karitiana Brazil 48 100 4.17 56.25 39.58 .54 |
td | Karitiana |
td | Brazil |
td | 48 |
td | 100 |
td | 4.17 |
td | 56.25 |
td | 39.58 |
td | .54 |
tr | Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33 |
td | Surui |
td | Brazil |
td | 42 |
td | 92.86 |
td | 7.14 |
td | .14 |
td | 16.67 |
td | 83.33 |
td | .33 |
tr | Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46 |
td | Piapoco and Curripaco |
td | Colombia |
td | 26 |
td | 100 |
td | 19.23 |
td | 26.92 |
td | 53.85 |
td | .46 |
tr | Pima Mexico 50 100 8.00 64.00 28.00 .36 |
td | Pima |
td | Mexico |
td | 50 |
td | 100 |
td | 8.00 |
td | 64.00 |
td | 28.00 |
td | .36 |
tr | Mayan Mexico 50 100 16.00 44.00 40.00 .56 |
td | Mayan |
td | Mexico |
td | 50 |
td | 100 |
td | 16.00 |
td | 44.00 |
td | 40.00 |
td | .56 |
tr | Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54 |
td | Total |
td | 2,128 |
td | .05 |
td | .14 |
td | 98.97 |
td | .47 |
td | .09 |
td | .09 |
td | .14 |
td | .05 |
td | .02 |
td | .14 |
td | 5.73 |
td | .33 |
td | 57.42 |
td | 11.37 |
td | 23.92 |
td | 1.08 |
td | .54 |
table-wrap-foot | a Heterozygosity values. |
footnote | a Heterozygosity values. |
label | a |
p | Heterozygosity values. |
table-wrap-foot | b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. |
footnote | b Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. |
label | b |
p | Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash. |
table-wrap-foot | c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. |
footnote | c Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. |
label | c |
p | Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi. |
table-wrap | Table 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. a Under the infinite-site model. b Probability of the observed heterozygosity under the infinite-site model. c Probability of the observed heterozygosity under the stepwise mutational model. |
label | Table 5 |
caption | Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L |
p | Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L |
table | Findings for Neck Regions of CD209 CD209L Heterozygosity P Heterozygosity P Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc African 1.6 27.9 .030 .000 50 37 .328 .229 European .6 15.3 .158 .094 62 44 .179 .304 Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 East Asian 1.2 34.5 .003 .000 47 42 .472 .054 Oceanian .0 … … … 72 53 .071 .337 American 2.8 16.3 .323 .205 45 29 .273 .440 Total sample 2.0 49.7 .002 .000 54 47 .405 .013 |
tr | Findings for Neck Regions of |
th | Findings for Neck Regions of |
tr | CD209 CD209L |
th | CD209 |
th | CD209L |
tr | Heterozygosity P Heterozygosity P |
th | Heterozygosity |
th | P |
th | Heterozygosity |
th | P |
tr | Population Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc |
th | Population |
th | Observed |
th | Expecteda |
th | ISMb |
th | SMMc |
th | Observed |
th | Expecteda |
th | ISMb |
th | SMMc |
tr | African 1.6 27.9 .030 .000 50 37 .328 .229 |
td | African |
td | 1.6 |
td | 27.9 |
td | .030 |
td | .000 |
td | 50 |
td | 37 |
td | .328 |
td | .229 |
tr | European .6 15.3 .158 .094 62 44 .179 .304 |
td | European |
td | .6 |
td | 15.3 |
td | .158 |
td | .094 |
td | 62 |
td | 44 |
td | .179 |
td | .304 |
tr | Middle Eastern 5.6 43.1 .018 .000 61 49 .299 .095 |
td | Middle Eastern |
td | 5.6 |
td | 43.1 |
td | .018 |
td | .000 |
td | 61 |
td | 49 |
td | .299 |
td | .095 |
tr | Central/South Asian 1.4 35.1 .003 .000 52 43 .387 .098 |
td | Central/South Asian |
td | 1.4 |
td | 35.1 |
td | .003 |
td | .000 |
td | 52 |
td | 43 |
td | .387 |
td | .098 |
tr | East Asian 1.2 34.5 .003 .000 47 42 .472 .054 |
td | East Asian |
td | 1.2 |
td | 34.5 |
td | .003 |
td | .000 |
td | 47 |
td | 42 |
td | .472 |
td | .054 |
tr | Oceanian .0 … … … 72 53 .071 .337 |
td | Oceanian |
td | .0 |
td | … |
td | … |
td | … |
td | 72 |
td | 53 |
td | .071 |
td | .337 |
tr | American 2.8 16.3 .323 .205 45 29 .273 .440 |
td | American |
td | 2.8 |
td | 16.3 |
td | .323 |
td | .205 |
td | 45 |
td | 29 |
td | .273 |
td | .440 |
tr | Total sample 2.0 49.7 .002 .000 54 47 .405 .013 |
td | Total sample |
td | 2.0 |
td | 49.7 |
td | .002 |
td | .000 |
td | 54 |
td | 47 |
td | .405 |
td | .013 |
table-wrap-foot | Note.— We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. |
footnote | Note.— |
p | Note.— |
footnote | We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. |
p | We presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics. |
table-wrap-foot | a Under the infinite-site model. |
footnote | a Under the infinite-site model. |
label | a |
p | Under the infinite-site model. |
table-wrap-foot | b Probability of the observed heterozygosity under the infinite-site model. |
footnote | b Probability of the observed heterozygosity under the infinite-site model. |
label | b |
p | Probability of the observed heterozygosity under the infinite-site model. |
table-wrap-foot | c Probability of the observed heterozygosity under the stepwise mutational model. |
footnote | c Probability of the observed heterozygosity under the stepwise mutational model. |
label | c |
p | Probability of the observed heterozygosity under the stepwise mutational model. |
sec | Time of the Most Recent Common Ancestor for CD209 The low levels of intragenic recombination observed in CD209 allowed maximum-likelihood coalescent analysis (Griffiths and Tavare 1994) for estimation of the time scale of the origin and evolution of this gene. Since this method assumes an infinite-site model without recombination, the same analysis for CD209L was not conducted because of the substantial amount of recombinant haplotypes observed. For CD209, only 29 of the 254 chromosomes analyzed had to be excluded, as did a single segregating site (SNP 939). The resulting CD209 gene tree estimate, rooted with the chimpanzee sequence (i.e., the chimpanzee sequence was used to define ancestral/derived status of human mutations), is shown in figure 6 . The tree is partitioned into two deep branches that correspond to haplotype clusters A and B. African samples were observed in both sides of the deepest node of the tree (i.e., in both clusters A and B), whereas non-African samples are restricted to one branch of the tree (i.e., cluster B). The maximum-likelihood estimate of θ (θML) for CD209 was 8.4. On the basis of this θML value and the estimated mutation rate (1.54×10−4 per gene per generation), the effective population size (N e) was 13,636, a value comparable to most figures reported in the literature (for a review, see Tishkoff and Verrelli [2003]). The T MRCA of the CD209 tree was then estimated at 2.8±0.22 MYA, one of the oldest T MRCA values estimated so far in the human genome (Excoffier 2002). Figure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. |
title | Time of the Most Recent Common Ancestor for CD209 |
p | The low levels of intragenic recombination observed in CD209 allowed maximum-likelihood coalescent analysis (Griffiths and Tavare 1994) for estimation of the time scale of the origin and evolution of this gene. Since this method assumes an infinite-site model without recombination, the same analysis for CD209L was not conducted because of the substantial amount of recombinant haplotypes observed. For CD209, only 29 of the 254 chromosomes analyzed had to be excluded, as did a single segregating site (SNP 939). The resulting CD209 gene tree estimate, rooted with the chimpanzee sequence (i.e., the chimpanzee sequence was used to define ancestral/derived status of human mutations), is shown in figure 6 . The tree is partitioned into two deep branches that correspond to haplotype clusters A and B. African samples were observed in both sides of the deepest node of the tree (i.e., in both clusters A and B), whereas non-African samples are restricted to one branch of the tree (i.e., cluster B). The maximum-likelihood estimate of θ (θML) for CD209 was 8.4. On the basis of this θML value and the estimated mutation rate (1.54×10−4 per gene per generation), the effective population size (N e) was 13,636, a value comparable to most figures reported in the literature (for a review, see Tishkoff and Verrelli [2003]). The T MRCA of the CD209 tree was then estimated at 2.8±0.22 MYA, one of the oldest T MRCA values estimated so far in the human genome (Excoffier 2002). Figure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. |
figure | Figure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. |
label | Figure 6 |
caption | CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. |
p | CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported. |
sec | Discussion The CD209/CD209L region possesses a number of characteristics that make it a powerful tool for evolutionary inference. These two genes are not in LD, despite their very close physical vicinity (∼15 kb), and each of them behaves as an independent genetic entity. Moreover, our results suggest that the CD209/CD209L region is a uniform landscape of genomic forces, since the two lectin-coding genes present similar mutation rates, as well as high nucleotide identity and conserved exon-intron organization (fig. 1). Contrasting Patterns of Diversity in the CD209/CD209L Region Our diversity study revealed completely different patterns for the two genes. First, levels of nucleotide diversity (π) were found to be much lower for CD209 than for CD209L (table 2). On the basis of 1.42 million SNPs, the International SNP Map Working Group defined 7.5×10−4 as the average value of nucleotide diversity for the human genome and showed that 95% of all bins presented π values varying from 2.0×10−4 to 15.8×10−4 (Sachidanandam et al. 2001). In addition, an independent study analyzed nucleotide and haplotype diversity for 313 genes and defined the average π value as 5.4×10−4 (Stephens et al. 2001). In this context, the values observed for CD209 (3–7×10−4) are in agreement with these genome estimations, with the exception of the African sample, which showed extreme levels of diversity (26.0×10−4) because of the presence of cluster A. By contrast, the π values observed for CD209L (16–18×10−4) are at least twofold higher than average genome estimates and fall into the upper limit of the 95% CI defined by the SNP Consortium (Sachidanandam et al. 2001). This contrast in nucleotide diversity between the two genes can be explained either by a disparity in local mutation rates or by actual differences in selective pressures. However, no major differences in mutation rates (1.57×10−9 vs. 1.70×10−9) were observed between the two homologues, nor was there substantial variation in GC content, which has been positively correlated with mutation rates and levels of polymorphisms (Sachidanandam et al. 2001; Smith et al. 2002; Waterston et al. 2002; Hellmann et al. 2003). Indeed, the GC content for CD209 (53.7%) was slightly higher than that observed for CD209L (50.9%), which reinforces the idea that different selective pressures may indeed have been the driving force behind the distinct patterns of diversity observed. Second, the patterns of repeat variation in the neck region also turned out to be strikingly different between the two genes. CD209 showed levels of heterozygosity of only 2%, whereas CD209L presented an extraordinarily high level of worldwide diversity, with an overall heterozygosity of 54% (table 5 and fig. 5). Although the neck regions of both genes share 92% of nucleotide identity, nonuniform mutation rates could, again, explain the patterns observed. However, this does not seem to be the case, since mutation-rate variation should influence the number of alleles observed rather than their frequencies, which are subject either to genetic drift or to natural selection. Indeed, we observed an even higher number of repeat alleles for CD209 (eight alleles) than for CD209L (seven alleles) (table 4 and fig. 5). Overall, differences in genomic forces seem to be insufficient to explain the contrasting patterns observed at both the sequence and neck-region length variation levels; therefore, the action of differential selective pressures acting on these genes becomes the most plausible scenario. CD209: The Signature of a Functional Constraint For CD209, not only nucleotide diversity but also F ST intercontinental values (0.15) were in conformity with previous worldwide estimations (Harpending and Rogers 2000; Akey et al. 2002; Cavalli-Sforza and Feldman 2003). For frequency-spectrum–based tests, only Fay and Wu's H test detected an excess of highly frequently derived alleles for the African and East Asian samples, a picture that may be interpreted as the result of a selective sweep. However, the significantly negative value observed in Africa is, again, exclusively due to the presence of cluster A, since 22 of the 35 fixed SNPs distinguishing it from cluster B corresponded to the derived allelic status in the latter cluster. Because cluster B accounts for 85% of the African variability, a clear excess of frequently derived alleles was observed. The extent to which the presence of this cluster is due to either natural selection or population structure will be discussed in detail below. For East Asia, the significance of the H test is also questionable when accounting for the confounding effects of demography. Indeed, when we plotted our H value against the empirical distribution of 132 H values from non-African populations (Akey et al. 2004), the East Asian P value became nonsignificant (P=.36). This observation reinforces the idea that the H test is particularly sensitive to past bottlenecks and/or population subdivision (Przeworski 2002). Thus, regarding the global levels of sequence diversity, the CD209 locus seems to evolve under evolutionary neutrality. Nevertheless, when we focused our analyses at the protein level, signs of natural selection were uncovered. Indeed, the McDonald-Kreitman test rejected neutrality for this gene because of a clear excess of polymorphic synonymous sites (i.e., a lack of nonsynonymous variants). In addition, when the number of synonymous sites (146) versus nonsynonymous sites (499) was compared with the observed number of synonymous (5) versus nonsynonymous (0) mutations, we detected a significant lack of nonsynonymous mutations (two-tailed Fisher exact test, P=6.3×10−4). These observations point to a strong selective constraint acting on CD209 that prevents the accumulation of amino acid replacements over time. Further support for a functional constraint in CD209 comes from the patterns of diversity observed in the neck region. In contrast to CD209L, virtually no variation was observed at CD209 (fig. 5A), with the 7-repeat allele accounting for 99% of the total variability. Moreover, the low levels of heterozygosity observed resulted in a consistent rejection of mutation-drift equilibrium in almost all geographical regions (table 5). The probability of finding such a low heterozygosity value, given the overall number of alleles observed, was estimated to be <0.2%, independent of the mutational model considered (table 5). Thus, the fact that no alleles other than the 7-repeat allele have increased in frequency, together with recent studies addressing the functional consequences of repeat-number variation in this region (Bernhard et al. 2004; Feinberg et al. 2005), strongly suggests a clear reduced fitness of any allele other than the 7-repeat allele. Interestingly, it has been recently shown that a protein with two fewer repeats (a 5-repeat allele) results in a partial dissociation of the final tetramer, whereas a protein with <5 repeats exhibits a dramatic reduction in overall stability (Feinberg et al. 2005), with all these differences having a direct impact on the quality of ligand-binding functions (Bernhard et al. 2004). Taken together, the patterns of diversity observed at CD209 clearly point to a strong functional constraint acting on this gene and further support the proposed crucial role of this lectin in pathogen recognition and in the early steps of immune response (Geijtenbeek et al. 2000b, 2004). CD209L: Relaxation of the Functional Constraint or Balancing Selection? In clear contrast to its homologue, CD209L presented extremely elevated nucleotide-diversity levels. High levels of diversity can result either from a relaxation of the functional constraint, which allows the stochastic accumulation of new mutations, or from the action of balancing selection, which maintains over time two or more functionally different alleles (and all linked variation) at intermediate frequencies. Several lines of evidence lend support to the selective hypothesis. First, if CD209L nucleotide diversity has been driven by the action of balancing selection, population-genetics relationships would have been accordingly altered. In this context, diversity studies in neutral, or assumedly neutral, regions of the genome—such as the Y chromosome (Underhill et al. 2000; Hammer et al. 2001; Jobling and Tyler-Smith 2003), mtDNA (Wallace et al. 1999; Ingman et al. 2000; Mishmar et al. 2003), Alu insertions (Watkins et al. 2001), as well as some autosomal genes (Stephens et al. 2001; Akey et al. 2004)—showed that African populations are genetically more diverse than are non-Africans, an observation generally interpreted as a support of the “Out of Africa” model for the origin of modern humans (Lewin 1987). For CD209L, even if we observed 1.5 times more segregating sites in African than in non-African populations, as indicated by the higher θw value found in Africa, similar values of nucleotide diversity were detected in the three groups, with Europeans presenting even higher π values than do Africans. This unusual scenario, which is at odds with neutral expectations, has already been described for other regions of the genome, such as the β-globin gene and the 5′cis-regulatory region of CCR5, for which the action of balancing selection has been convincingly proposed (Harding et al. 1997; Bamshad et al. 2002). Second, balancing selection tends to increase within-population diversity while decreasing F ST, compared with neutrally evolving loci (Cavalli-Sforza 1966; Harpending and Rogers 2000; Akey et al. 2002; Bamshad and Wooding 2003; Cavalli-Sforza and Feldman 2003). Indeed, our data are compatible with these predictions, since the 5%F ST value observed for CD209L is threefold lower than that estimated for CD209 (15%) and is similar to that found, for example, for the bitter-taste receptor gene (5.6%), for which there is compelling evidence of balancing-selection action (Wooding et al. 2004). Third, results of our Tajima's D analysis were significantly positive for European and East Asian populations, because of the skew of CD209L frequency spectrum toward an excess of intermediate-frequency alleles (table 2), a pattern that further supports the action of balancing selection. However, since the null model used to assess significance makes unrealistic assumptions about past population demography (i.e., constant population sizes), the rejection of the standard neutral model cannot be interpreted as unambiguous evidence of selection. Indeed, the observation that only non-African populations showed a significant departure from neutrality raises the question of whether these patterns could have resulted instead from the bottleneck that occurred during the Out of Africa exodus. A way to circumvent this conundrum is to analytically integrate the fact that demography affects all the genome equally, whereas selection directs its effects toward specific loci. Thus, to correct for the confounding effects of demography, we plotted our results against the empirical distributions of Akey et al. (2004) for Tajima's D statistics. Our values remained significant for CD209L, which therefore reinforces the idea that the pattern observed is unlikely to be the sole result of demography. Last, if the patterns of variation in CD209L represent the molecular signature of balancing selection, at least in non-Africans, then a functional target of such selective regime is needed. In this context, the neck region constitutes an excellent candidate, since it plays a major mediating role in the orientation and flexibility of the carbohydrate-recognition domain. Since this domain is directly involved in pathogen recognition, neck-region length variation has important consequences for the pathogen-binding properties of these lectins (Mitchell et al. 2001; Bernhard et al. 2004; Feinberg et al. 2005). In perfect agreement with the results of our sequence-based data set, higher diversity in repeat variation was observed in the neck region among non-African populations (Native Americans excepted). Out of Africa, at least three alleles account for most population diversity, whereas, in Africa, the 6- and 7-repeat alleles alone account for 96% of the global variability (fig. 5B). Again, the higher diversity observed out of Africa could be due to a higher level of relaxation of the functional constraint of the neck region in non-African compared with African populations, which would lead to a random accumulation of proteins with varying neck-region lengths among non-Africans. Conversely, these patterns could also be explained by the action of balancing selection in non-Africans and could therefore point to the neck region as the functional target of such selective regime. To evaluate the plausibility of these two conflicting scenarios, we compared the variation in the CD209L neck region with that inferred from 377 neutral autosomal microsatellites typed elsewhere for the same population panel (Rosenberg et al. 2002). We reasoned that if CD209L diversity has been shaped only by demography (i.e., bottleneck out of Africa), the distribution of genetic variance at different hierarchical levels should be comparable to that inferred through the neutral markers. On the other hand, if selection has driven the CD209L neck-region diversity, population-genetics distances would be influenced accordingly and would therefore differ from neutral expectations. Indeed, the AMOVA values inferred for CD209L fell systematically outside the 95% CI defined for the microsatellite data set (table 6 ). We observed that populations within Europe, Asia, the Middle East, and Oceania exhibited lower-than-expected diversity among populations within the same region. A reduction of genetic distances between populations is expected under balancing selection; therefore, the results from the CD209L neck region favor, once again, the action of this selective regime in most non-African populations, in detriment of the neutral hypothesis. One may argue that the differences in the proportions of genetic variance between our data and those of Rosenberg et al. (2002) could be due to differences in the pace of mutation between microsatellite loci and our neck repeated region that could be considered a “coding minisatellite.” However, under neutrality, differences in mutation rate should have a similar and proportional effect in all population comparisons and should influence all values with a similar tendency (i.e., higher or lower values). Indeed, this is not the case: populations within Europe, the Middle East, Central/South Asia, East Asia, and Oceania turned out to be genetically closer than expected, whereas populations within Africa and the Americas exhibited the opposite pattern (table 6), which makes it highly unlikely that mutation-rate differences influenced our conclusions. Table 6 AMOVA for the Neck Region of CD209L AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. a Populations are grouped as described by Rosenberg et al. (2002). b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). Taken together, the integration of the results from levels of nucleotide and amino acid diversity, neutrality tests, population-genetics distances, and neck-region length variation in CD209 and CD209L clearly points to a situation in which CD209 has been under a strong selective constraint that prevents accumulation of any of amino acid changes over time, whereas CD209L variability has most likely been driven by the action of balancing selection, at least in non-African populations. The Footprints of Ancestral Population Diversity In apparent dichotomy with the strong selective constraint described for CD209, we observed an unusual excess of diversity of 35 fixed differences separating the two basal branches of the gene tree (fig. 6). In addition, we estimated a T MRCA of 2.8±0.22 MYA, a time that places the most recent common ancestor of CD209 back in the Pliocene epoch, before the estimated time for the origins of the genus Homo ∼1.9 MYA (Wood 1996; Wood and Collard 1999). A number of studies have already reported loci that present unusually deep coalescent times (Harris and Hey 1999; Zhao et al. 2000; Webster et al. 2003; Garrigan et al. 2005a, 2005b), but our estimation for CD209 remains one of the deepest T MRCA values yet reported (Excoffier 2002). The probability of finding such a deep coalescence time under a scenario of a random-mating population was estimated, through a coalescent process (Laval and Excoffier 2004), to be very low (P=.018) (see fig. 7 ). In addition to the unexpected antiquity of the CD209 locus, we observed a peculiar tree topology made of two highly divergent and frequency-unbalanced lineages, cluster A embracing only 2 internal haplotypes and cluster B comprising the remaining 23 (fig. 6). Figure 7 Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. Different hypotheses can account for such elongated and divergent haplotype patterns. Indeed, the high levels of nucleotide identity between CD209 and CD209L could have led to gene conversion between the two genes, an event that would explain the outlier position of cluster A in the context of CD209 phylogeny. We reasoned that if gene conversion has occurred, we expect that the derived alleles distinguishing clusters A and B in CD209 would correspond to the allelic state observed in their homologous positions in CD209L. Of all positions, only four fit this criterion. In addition, these positions were not physically clustered, which therefore excludes a major gene-conversion event as the explanation of the divergent CD209 phylogeny. Two other circumstances may be responsible for the topology and the time depth of the CD209 gene tree: long-standing balancing selection or ancient population structure, with Africa, in both cases, being the arena of such events (i.e., cluster A is restricted to Africa). Several lines of evidence argue against the balancing-selection hypothesis. First, under this selective regime, one would expect that Tajima's D test would also point in this direction by yielding significantly positive values, which is not the case (table 2). Second, such a long-standing balancing selection in Africa would have entailed a number of recombinant haplotypes between clusters A and B, which, again, is not the case, as illustrated by the high LD levels at CD209 (fig. 3). Third, a claim of balancing selection at this locus must imply a functional difference between the two balanced alleles. Indeed, three nonsynonymous mutations, situated in the neck region, separate cluster A and B, and they could correspond to the alleles under selection. But, if the neck region is the target of selection, it is more likely that the balanced alleles would correspond to different numbers of repeats rather than punctual nucleotide variation within each track, as observed for CD209L and suggested by functional studies (Bernhard et al. 2004; Feinberg et al. 2005). Since no variation in the number of repeats was detected between both clusters, we predict that there are no major functional differences between the two lineages. Taken together, maintenance of ancient lineages by balancing selection does not seem to be responsible for the observed haplotype divergence. In this view, the patterns observed are best explained by an ancestral population structure on the African continent. Indeed, several studies have already proposed that African populations must have been more strongly subdivided and isolated than non-African ones (Harris and Hey 1999; Labuda et al. 2000; Excoffier 2002; Goldstein and Chikhi 2002; Harding and McVean 2004; Satta and Takahata 2004; Garrigan et al. 2005a). In particular, a recent study of the Xp21.1 locus presented convincing statistical evidence that supports the hypothesis that our species does not descend from a single, historically panmictic population (Garrigan et al. 2005a). The divergent haplotype pattern observed at the Xp21.1 locus prompted those authors to explain their data under the isolation-and-admixture (IAA) model and/or a metapopulation model (Harding and McVean 2004; Wakeley 2004). Indeed, as observed for CD209, under an IAA model, the two basal branches are expected to be longer than those under a Wright-Fisher model, depending on the length of time subpopulations spent in isolation. The extent to which the IAA model fits the data depends on the number of mutations, referred as to “congruent sites,” occurring in the two basal branches of the genealogy. For Xp21.1, 10 congruent sites over 24 polymorphisms were observed (i.e., ∼42% of the total number of sites). We applied the same approach to CD209 and obtained a very similar percentage of ∼45%, in good accordance with the IAA model. Our observations, together with a number of autosomal diversity studies, show that modern human diversity appears to have kept genetic traces of admixture among archaic hominid populations. However, a number of questions remain unanswered, such as the time when these admixture events occurred (i.e., before or after the appearance of anatomically modern humans), the precise quantitative contribution of ancient genetic material to our modern gene pool, and the geographic provenance of these genetic vestiges. |
title | Discussion |
p | The CD209/CD209L region possesses a number of characteristics that make it a powerful tool for evolutionary inference. These two genes are not in LD, despite their very close physical vicinity (∼15 kb), and each of them behaves as an independent genetic entity. Moreover, our results suggest that the CD209/CD209L region is a uniform landscape of genomic forces, since the two lectin-coding genes present similar mutation rates, as well as high nucleotide identity and conserved exon-intron organization (fig. 1). |
sec | Contrasting Patterns of Diversity in the CD209/CD209L Region Our diversity study revealed completely different patterns for the two genes. First, levels of nucleotide diversity (π) were found to be much lower for CD209 than for CD209L (table 2). On the basis of 1.42 million SNPs, the International SNP Map Working Group defined 7.5×10−4 as the average value of nucleotide diversity for the human genome and showed that 95% of all bins presented π values varying from 2.0×10−4 to 15.8×10−4 (Sachidanandam et al. 2001). In addition, an independent study analyzed nucleotide and haplotype diversity for 313 genes and defined the average π value as 5.4×10−4 (Stephens et al. 2001). In this context, the values observed for CD209 (3–7×10−4) are in agreement with these genome estimations, with the exception of the African sample, which showed extreme levels of diversity (26.0×10−4) because of the presence of cluster A. By contrast, the π values observed for CD209L (16–18×10−4) are at least twofold higher than average genome estimates and fall into the upper limit of the 95% CI defined by the SNP Consortium (Sachidanandam et al. 2001). This contrast in nucleotide diversity between the two genes can be explained either by a disparity in local mutation rates or by actual differences in selective pressures. However, no major differences in mutation rates (1.57×10−9 vs. 1.70×10−9) were observed between the two homologues, nor was there substantial variation in GC content, which has been positively correlated with mutation rates and levels of polymorphisms (Sachidanandam et al. 2001; Smith et al. 2002; Waterston et al. 2002; Hellmann et al. 2003). Indeed, the GC content for CD209 (53.7%) was slightly higher than that observed for CD209L (50.9%), which reinforces the idea that different selective pressures may indeed have been the driving force behind the distinct patterns of diversity observed. Second, the patterns of repeat variation in the neck region also turned out to be strikingly different between the two genes. CD209 showed levels of heterozygosity of only 2%, whereas CD209L presented an extraordinarily high level of worldwide diversity, with an overall heterozygosity of 54% (table 5 and fig. 5). Although the neck regions of both genes share 92% of nucleotide identity, nonuniform mutation rates could, again, explain the patterns observed. However, this does not seem to be the case, since mutation-rate variation should influence the number of alleles observed rather than their frequencies, which are subject either to genetic drift or to natural selection. Indeed, we observed an even higher number of repeat alleles for CD209 (eight alleles) than for CD209L (seven alleles) (table 4 and fig. 5). Overall, differences in genomic forces seem to be insufficient to explain the contrasting patterns observed at both the sequence and neck-region length variation levels; therefore, the action of differential selective pressures acting on these genes becomes the most plausible scenario. |
title | Contrasting Patterns of Diversity in the CD209/CD209L Region |
p | Our diversity study revealed completely different patterns for the two genes. First, levels of nucleotide diversity (π) were found to be much lower for CD209 than for CD209L (table 2). On the basis of 1.42 million SNPs, the International SNP Map Working Group defined 7.5×10−4 as the average value of nucleotide diversity for the human genome and showed that 95% of all bins presented π values varying from 2.0×10−4 to 15.8×10−4 (Sachidanandam et al. 2001). In addition, an independent study analyzed nucleotide and haplotype diversity for 313 genes and defined the average π value as 5.4×10−4 (Stephens et al. 2001). In this context, the values observed for CD209 (3–7×10−4) are in agreement with these genome estimations, with the exception of the African sample, which showed extreme levels of diversity (26.0×10−4) because of the presence of cluster A. By contrast, the π values observed for CD209L (16–18×10−4) are at least twofold higher than average genome estimates and fall into the upper limit of the 95% CI defined by the SNP Consortium (Sachidanandam et al. 2001). This contrast in nucleotide diversity between the two genes can be explained either by a disparity in local mutation rates or by actual differences in selective pressures. However, no major differences in mutation rates (1.57×10−9 vs. 1.70×10−9) were observed between the two homologues, nor was there substantial variation in GC content, which has been positively correlated with mutation rates and levels of polymorphisms (Sachidanandam et al. 2001; Smith et al. 2002; Waterston et al. 2002; Hellmann et al. 2003). Indeed, the GC content for CD209 (53.7%) was slightly higher than that observed for CD209L (50.9%), which reinforces the idea that different selective pressures may indeed have been the driving force behind the distinct patterns of diversity observed. Second, the patterns of repeat variation in the neck region also turned out to be strikingly different between the two genes. CD209 showed levels of heterozygosity of only 2%, whereas CD209L presented an extraordinarily high level of worldwide diversity, with an overall heterozygosity of 54% (table 5 and fig. 5). Although the neck regions of both genes share 92% of nucleotide identity, nonuniform mutation rates could, again, explain the patterns observed. However, this does not seem to be the case, since mutation-rate variation should influence the number of alleles observed rather than their frequencies, which are subject either to genetic drift or to natural selection. Indeed, we observed an even higher number of repeat alleles for CD209 (eight alleles) than for CD209L (seven alleles) (table 4 and fig. 5). Overall, differences in genomic forces seem to be insufficient to explain the contrasting patterns observed at both the sequence and neck-region length variation levels; therefore, the action of differential selective pressures acting on these genes becomes the most plausible scenario. |
sec | CD209: The Signature of a Functional Constraint For CD209, not only nucleotide diversity but also F ST intercontinental values (0.15) were in conformity with previous worldwide estimations (Harpending and Rogers 2000; Akey et al. 2002; Cavalli-Sforza and Feldman 2003). For frequency-spectrum–based tests, only Fay and Wu's H test detected an excess of highly frequently derived alleles for the African and East Asian samples, a picture that may be interpreted as the result of a selective sweep. However, the significantly negative value observed in Africa is, again, exclusively due to the presence of cluster A, since 22 of the 35 fixed SNPs distinguishing it from cluster B corresponded to the derived allelic status in the latter cluster. Because cluster B accounts for 85% of the African variability, a clear excess of frequently derived alleles was observed. The extent to which the presence of this cluster is due to either natural selection or population structure will be discussed in detail below. For East Asia, the significance of the H test is also questionable when accounting for the confounding effects of demography. Indeed, when we plotted our H value against the empirical distribution of 132 H values from non-African populations (Akey et al. 2004), the East Asian P value became nonsignificant (P=.36). This observation reinforces the idea that the H test is particularly sensitive to past bottlenecks and/or population subdivision (Przeworski 2002). Thus, regarding the global levels of sequence diversity, the CD209 locus seems to evolve under evolutionary neutrality. Nevertheless, when we focused our analyses at the protein level, signs of natural selection were uncovered. Indeed, the McDonald-Kreitman test rejected neutrality for this gene because of a clear excess of polymorphic synonymous sites (i.e., a lack of nonsynonymous variants). In addition, when the number of synonymous sites (146) versus nonsynonymous sites (499) was compared with the observed number of synonymous (5) versus nonsynonymous (0) mutations, we detected a significant lack of nonsynonymous mutations (two-tailed Fisher exact test, P=6.3×10−4). These observations point to a strong selective constraint acting on CD209 that prevents the accumulation of amino acid replacements over time. Further support for a functional constraint in CD209 comes from the patterns of diversity observed in the neck region. In contrast to CD209L, virtually no variation was observed at CD209 (fig. 5A), with the 7-repeat allele accounting for 99% of the total variability. Moreover, the low levels of heterozygosity observed resulted in a consistent rejection of mutation-drift equilibrium in almost all geographical regions (table 5). The probability of finding such a low heterozygosity value, given the overall number of alleles observed, was estimated to be <0.2%, independent of the mutational model considered (table 5). Thus, the fact that no alleles other than the 7-repeat allele have increased in frequency, together with recent studies addressing the functional consequences of repeat-number variation in this region (Bernhard et al. 2004; Feinberg et al. 2005), strongly suggests a clear reduced fitness of any allele other than the 7-repeat allele. Interestingly, it has been recently shown that a protein with two fewer repeats (a 5-repeat allele) results in a partial dissociation of the final tetramer, whereas a protein with <5 repeats exhibits a dramatic reduction in overall stability (Feinberg et al. 2005), with all these differences having a direct impact on the quality of ligand-binding functions (Bernhard et al. 2004). Taken together, the patterns of diversity observed at CD209 clearly point to a strong functional constraint acting on this gene and further support the proposed crucial role of this lectin in pathogen recognition and in the early steps of immune response (Geijtenbeek et al. 2000b, 2004). |
title | CD209: The Signature of a Functional Constraint |
p | For CD209, not only nucleotide diversity but also F ST intercontinental values (0.15) were in conformity with previous worldwide estimations (Harpending and Rogers 2000; Akey et al. 2002; Cavalli-Sforza and Feldman 2003). For frequency-spectrum–based tests, only Fay and Wu's H test detected an excess of highly frequently derived alleles for the African and East Asian samples, a picture that may be interpreted as the result of a selective sweep. However, the significantly negative value observed in Africa is, again, exclusively due to the presence of cluster A, since 22 of the 35 fixed SNPs distinguishing it from cluster B corresponded to the derived allelic status in the latter cluster. Because cluster B accounts for 85% of the African variability, a clear excess of frequently derived alleles was observed. The extent to which the presence of this cluster is due to either natural selection or population structure will be discussed in detail below. For East Asia, the significance of the H test is also questionable when accounting for the confounding effects of demography. Indeed, when we plotted our H value against the empirical distribution of 132 H values from non-African populations (Akey et al. 2004), the East Asian P value became nonsignificant (P=.36). This observation reinforces the idea that the H test is particularly sensitive to past bottlenecks and/or population subdivision (Przeworski 2002). Thus, regarding the global levels of sequence diversity, the CD209 locus seems to evolve under evolutionary neutrality. Nevertheless, when we focused our analyses at the protein level, signs of natural selection were uncovered. Indeed, the McDonald-Kreitman test rejected neutrality for this gene because of a clear excess of polymorphic synonymous sites (i.e., a lack of nonsynonymous variants). In addition, when the number of synonymous sites (146) versus nonsynonymous sites (499) was compared with the observed number of synonymous (5) versus nonsynonymous (0) mutations, we detected a significant lack of nonsynonymous mutations (two-tailed Fisher exact test, P=6.3×10−4). These observations point to a strong selective constraint acting on CD209 that prevents the accumulation of amino acid replacements over time. |
p | Further support for a functional constraint in CD209 comes from the patterns of diversity observed in the neck region. In contrast to CD209L, virtually no variation was observed at CD209 (fig. 5A), with the 7-repeat allele accounting for 99% of the total variability. Moreover, the low levels of heterozygosity observed resulted in a consistent rejection of mutation-drift equilibrium in almost all geographical regions (table 5). The probability of finding such a low heterozygosity value, given the overall number of alleles observed, was estimated to be <0.2%, independent of the mutational model considered (table 5). Thus, the fact that no alleles other than the 7-repeat allele have increased in frequency, together with recent studies addressing the functional consequences of repeat-number variation in this region (Bernhard et al. 2004; Feinberg et al. 2005), strongly suggests a clear reduced fitness of any allele other than the 7-repeat allele. Interestingly, it has been recently shown that a protein with two fewer repeats (a 5-repeat allele) results in a partial dissociation of the final tetramer, whereas a protein with <5 repeats exhibits a dramatic reduction in overall stability (Feinberg et al. 2005), with all these differences having a direct impact on the quality of ligand-binding functions (Bernhard et al. 2004). Taken together, the patterns of diversity observed at CD209 clearly point to a strong functional constraint acting on this gene and further support the proposed crucial role of this lectin in pathogen recognition and in the early steps of immune response (Geijtenbeek et al. 2000b, 2004). |
sec | CD209L: Relaxation of the Functional Constraint or Balancing Selection? In clear contrast to its homologue, CD209L presented extremely elevated nucleotide-diversity levels. High levels of diversity can result either from a relaxation of the functional constraint, which allows the stochastic accumulation of new mutations, or from the action of balancing selection, which maintains over time two or more functionally different alleles (and all linked variation) at intermediate frequencies. Several lines of evidence lend support to the selective hypothesis. First, if CD209L nucleotide diversity has been driven by the action of balancing selection, population-genetics relationships would have been accordingly altered. In this context, diversity studies in neutral, or assumedly neutral, regions of the genome—such as the Y chromosome (Underhill et al. 2000; Hammer et al. 2001; Jobling and Tyler-Smith 2003), mtDNA (Wallace et al. 1999; Ingman et al. 2000; Mishmar et al. 2003), Alu insertions (Watkins et al. 2001), as well as some autosomal genes (Stephens et al. 2001; Akey et al. 2004)—showed that African populations are genetically more diverse than are non-Africans, an observation generally interpreted as a support of the “Out of Africa” model for the origin of modern humans (Lewin 1987). For CD209L, even if we observed 1.5 times more segregating sites in African than in non-African populations, as indicated by the higher θw value found in Africa, similar values of nucleotide diversity were detected in the three groups, with Europeans presenting even higher π values than do Africans. This unusual scenario, which is at odds with neutral expectations, has already been described for other regions of the genome, such as the β-globin gene and the 5′cis-regulatory region of CCR5, for which the action of balancing selection has been convincingly proposed (Harding et al. 1997; Bamshad et al. 2002). Second, balancing selection tends to increase within-population diversity while decreasing F ST, compared with neutrally evolving loci (Cavalli-Sforza 1966; Harpending and Rogers 2000; Akey et al. 2002; Bamshad and Wooding 2003; Cavalli-Sforza and Feldman 2003). Indeed, our data are compatible with these predictions, since the 5%F ST value observed for CD209L is threefold lower than that estimated for CD209 (15%) and is similar to that found, for example, for the bitter-taste receptor gene (5.6%), for which there is compelling evidence of balancing-selection action (Wooding et al. 2004). Third, results of our Tajima's D analysis were significantly positive for European and East Asian populations, because of the skew of CD209L frequency spectrum toward an excess of intermediate-frequency alleles (table 2), a pattern that further supports the action of balancing selection. However, since the null model used to assess significance makes unrealistic assumptions about past population demography (i.e., constant population sizes), the rejection of the standard neutral model cannot be interpreted as unambiguous evidence of selection. Indeed, the observation that only non-African populations showed a significant departure from neutrality raises the question of whether these patterns could have resulted instead from the bottleneck that occurred during the Out of Africa exodus. A way to circumvent this conundrum is to analytically integrate the fact that demography affects all the genome equally, whereas selection directs its effects toward specific loci. Thus, to correct for the confounding effects of demography, we plotted our results against the empirical distributions of Akey et al. (2004) for Tajima's D statistics. Our values remained significant for CD209L, which therefore reinforces the idea that the pattern observed is unlikely to be the sole result of demography. Last, if the patterns of variation in CD209L represent the molecular signature of balancing selection, at least in non-Africans, then a functional target of such selective regime is needed. In this context, the neck region constitutes an excellent candidate, since it plays a major mediating role in the orientation and flexibility of the carbohydrate-recognition domain. Since this domain is directly involved in pathogen recognition, neck-region length variation has important consequences for the pathogen-binding properties of these lectins (Mitchell et al. 2001; Bernhard et al. 2004; Feinberg et al. 2005). In perfect agreement with the results of our sequence-based data set, higher diversity in repeat variation was observed in the neck region among non-African populations (Native Americans excepted). Out of Africa, at least three alleles account for most population diversity, whereas, in Africa, the 6- and 7-repeat alleles alone account for 96% of the global variability (fig. 5B). Again, the higher diversity observed out of Africa could be due to a higher level of relaxation of the functional constraint of the neck region in non-African compared with African populations, which would lead to a random accumulation of proteins with varying neck-region lengths among non-Africans. Conversely, these patterns could also be explained by the action of balancing selection in non-Africans and could therefore point to the neck region as the functional target of such selective regime. To evaluate the plausibility of these two conflicting scenarios, we compared the variation in the CD209L neck region with that inferred from 377 neutral autosomal microsatellites typed elsewhere for the same population panel (Rosenberg et al. 2002). We reasoned that if CD209L diversity has been shaped only by demography (i.e., bottleneck out of Africa), the distribution of genetic variance at different hierarchical levels should be comparable to that inferred through the neutral markers. On the other hand, if selection has driven the CD209L neck-region diversity, population-genetics distances would be influenced accordingly and would therefore differ from neutral expectations. Indeed, the AMOVA values inferred for CD209L fell systematically outside the 95% CI defined for the microsatellite data set (table 6 ). We observed that populations within Europe, Asia, the Middle East, and Oceania exhibited lower-than-expected diversity among populations within the same region. A reduction of genetic distances between populations is expected under balancing selection; therefore, the results from the CD209L neck region favor, once again, the action of this selective regime in most non-African populations, in detriment of the neutral hypothesis. One may argue that the differences in the proportions of genetic variance between our data and those of Rosenberg et al. (2002) could be due to differences in the pace of mutation between microsatellite loci and our neck repeated region that could be considered a “coding minisatellite.” However, under neutrality, differences in mutation rate should have a similar and proportional effect in all population comparisons and should influence all values with a similar tendency (i.e., higher or lower values). Indeed, this is not the case: populations within Europe, the Middle East, Central/South Asia, East Asia, and Oceania turned out to be genetically closer than expected, whereas populations within Africa and the Americas exhibited the opposite pattern (table 6), which makes it highly unlikely that mutation-rate differences influenced our conclusions. Table 6 AMOVA for the Neck Region of CD209L AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. a Populations are grouped as described by Rosenberg et al. (2002). b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). Taken together, the integration of the results from levels of nucleotide and amino acid diversity, neutrality tests, population-genetics distances, and neck-region length variation in CD209 and CD209L clearly points to a situation in which CD209 has been under a strong selective constraint that prevents accumulation of any of amino acid changes over time, whereas CD209L variability has most likely been driven by the action of balancing selection, at least in non-African populations. |
title | CD209L: Relaxation of the Functional Constraint or Balancing Selection? |
p | In clear contrast to its homologue, CD209L presented extremely elevated nucleotide-diversity levels. High levels of diversity can result either from a relaxation of the functional constraint, which allows the stochastic accumulation of new mutations, or from the action of balancing selection, which maintains over time two or more functionally different alleles (and all linked variation) at intermediate frequencies. Several lines of evidence lend support to the selective hypothesis. First, if CD209L nucleotide diversity has been driven by the action of balancing selection, population-genetics relationships would have been accordingly altered. In this context, diversity studies in neutral, or assumedly neutral, regions of the genome—such as the Y chromosome (Underhill et al. 2000; Hammer et al. 2001; Jobling and Tyler-Smith 2003), mtDNA (Wallace et al. 1999; Ingman et al. 2000; Mishmar et al. 2003), Alu insertions (Watkins et al. 2001), as well as some autosomal genes (Stephens et al. 2001; Akey et al. 2004)—showed that African populations are genetically more diverse than are non-Africans, an observation generally interpreted as a support of the “Out of Africa” model for the origin of modern humans (Lewin 1987). For CD209L, even if we observed 1.5 times more segregating sites in African than in non-African populations, as indicated by the higher θw value found in Africa, similar values of nucleotide diversity were detected in the three groups, with Europeans presenting even higher π values than do Africans. This unusual scenario, which is at odds with neutral expectations, has already been described for other regions of the genome, such as the β-globin gene and the 5′cis-regulatory region of CCR5, for which the action of balancing selection has been convincingly proposed (Harding et al. 1997; Bamshad et al. 2002). Second, balancing selection tends to increase within-population diversity while decreasing F ST, compared with neutrally evolving loci (Cavalli-Sforza 1966; Harpending and Rogers 2000; Akey et al. 2002; Bamshad and Wooding 2003; Cavalli-Sforza and Feldman 2003). Indeed, our data are compatible with these predictions, since the 5%F ST value observed for CD209L is threefold lower than that estimated for CD209 (15%) and is similar to that found, for example, for the bitter-taste receptor gene (5.6%), for which there is compelling evidence of balancing-selection action (Wooding et al. 2004). Third, results of our Tajima's D analysis were significantly positive for European and East Asian populations, because of the skew of CD209L frequency spectrum toward an excess of intermediate-frequency alleles (table 2), a pattern that further supports the action of balancing selection. However, since the null model used to assess significance makes unrealistic assumptions about past population demography (i.e., constant population sizes), the rejection of the standard neutral model cannot be interpreted as unambiguous evidence of selection. Indeed, the observation that only non-African populations showed a significant departure from neutrality raises the question of whether these patterns could have resulted instead from the bottleneck that occurred during the Out of Africa exodus. A way to circumvent this conundrum is to analytically integrate the fact that demography affects all the genome equally, whereas selection directs its effects toward specific loci. Thus, to correct for the confounding effects of demography, we plotted our results against the empirical distributions of Akey et al. (2004) for Tajima's D statistics. Our values remained significant for CD209L, which therefore reinforces the idea that the pattern observed is unlikely to be the sole result of demography. |
p | Last, if the patterns of variation in CD209L represent the molecular signature of balancing selection, at least in non-Africans, then a functional target of such selective regime is needed. In this context, the neck region constitutes an excellent candidate, since it plays a major mediating role in the orientation and flexibility of the carbohydrate-recognition domain. Since this domain is directly involved in pathogen recognition, neck-region length variation has important consequences for the pathogen-binding properties of these lectins (Mitchell et al. 2001; Bernhard et al. 2004; Feinberg et al. 2005). In perfect agreement with the results of our sequence-based data set, higher diversity in repeat variation was observed in the neck region among non-African populations (Native Americans excepted). Out of Africa, at least three alleles account for most population diversity, whereas, in Africa, the 6- and 7-repeat alleles alone account for 96% of the global variability (fig. 5B). Again, the higher diversity observed out of Africa could be due to a higher level of relaxation of the functional constraint of the neck region in non-African compared with African populations, which would lead to a random accumulation of proteins with varying neck-region lengths among non-Africans. Conversely, these patterns could also be explained by the action of balancing selection in non-Africans and could therefore point to the neck region as the functional target of such selective regime. To evaluate the plausibility of these two conflicting scenarios, we compared the variation in the CD209L neck region with that inferred from 377 neutral autosomal microsatellites typed elsewhere for the same population panel (Rosenberg et al. 2002). We reasoned that if CD209L diversity has been shaped only by demography (i.e., bottleneck out of Africa), the distribution of genetic variance at different hierarchical levels should be comparable to that inferred through the neutral markers. On the other hand, if selection has driven the CD209L neck-region diversity, population-genetics distances would be influenced accordingly and would therefore differ from neutral expectations. Indeed, the AMOVA values inferred for CD209L fell systematically outside the 95% CI defined for the microsatellite data set (table 6 ). We observed that populations within Europe, Asia, the Middle East, and Oceania exhibited lower-than-expected diversity among populations within the same region. A reduction of genetic distances between populations is expected under balancing selection; therefore, the results from the CD209L neck region favor, once again, the action of this selective regime in most non-African populations, in detriment of the neutral hypothesis. One may argue that the differences in the proportions of genetic variance between our data and those of Rosenberg et al. (2002) could be due to differences in the pace of mutation between microsatellite loci and our neck repeated region that could be considered a “coding minisatellite.” However, under neutrality, differences in mutation rate should have a similar and proportional effect in all population comparisons and should influence all values with a similar tendency (i.e., higher or lower values). Indeed, this is not the case: populations within Europe, the Middle East, Central/South Asia, East Asia, and Oceania turned out to be genetically closer than expected, whereas populations within Africa and the Americas exhibited the opposite pattern (table 6), which makes it highly unlikely that mutation-rate differences influenced our conclusions. Table 6 AMOVA for the Neck Region of CD209L AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. a Populations are grouped as described by Rosenberg et al. (2002). b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). |
table-wrap | Table 6 AMOVA for the Neck Region of CD209L AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. a Populations are grouped as described by Rosenberg et al. (2002). b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). |
label | Table 6 |
caption | AMOVA for the Neck Region of CD209L |
p | AMOVA for the Neck Region of CD209L |
table | AMOVA Value (95% CI) Inferred forCD209Lb Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) |
tr | AMOVA Value (95% CI) Inferred forCD209Lb |
th | AMOVA Value (95% CI) Inferred forCD209Lb |
tr | Samplea No. of Regions No. of Populations Within Populations Among Populations within Regions Among Regions |
th | Samplea |
th | No. of Regions |
th | No. of Populations |
th | Within Populations |
th | Among Populations within Regions |
th | Among Regions |
tr | World 7 52 90.4 (93.8–94.3) 2.1 (2.3–2.5) 7.57 (3.3–3.9) |
td | World |
td | 7 |
td | 52 |
td | 90.4 (93.8–94.3) |
td | 2.1 (2.3–2.5) |
td | 7.57 (3.3–3.9) |
tr | Africa 1 6 93.9 (96.7–97.1) 6.1 (2.9–3.3) |
td | Africa |
td | 1 |
td | 6 |
td | 93.9 (96.7–97.1) |
td | 6.1 (2.9–3.3) |
tr | Eurasia 3 21 97.0 (98.2–98.4) .2 (1.1–1.3) 2.8 (.4–.6) |
td | Eurasia |
td | 3 |
td | 21 |
td | 97.0 (98.2–98.4) |
td | .2 (1.1–1.3) |
td | 2.8 (.4–.6) |
tr | Europe 1 8 99.5 (99.1–99.4) .5 (0.6–0.9) |
td | Europe |
td | 1 |
td | 8 |
td | 99.5 (99.1–99.4) |
td | .5 (0.6–0.9) |
tr | Middle-East 1 4 100 (98.6–98.8) 0 (1.2–1.4) |
td | Middle-East |
td | 1 |
td | 4 |
td | 100 (98.6–98.8) |
td | 0 (1.2–1.4) |
tr | Central/South Asia 1 9 99.5 (98.5–98.8) .5 (1.2–1.5) |
td | Central/South Asia |
td | 1 |
td | 9 |
td | 99.5 (98.5–98.8) |
td | .5 (1.2–1.5) |
tr | East Asia 1 18 99.3 (98.6–98.9) .7 (1.1–1.4) |
td | East Asia |
td | 1 |
td | 18 |
td | 99.3 (98.6–98.9) |
td | .7 (1.1–1.4) |
tr | Oceania 1 2 96.0 (92.8–94.3) 4.0 (5.7–7.2) |
td | Oceania |
td | 1 |
td | 2 |
td | 96.0 (92.8–94.3) |
td | 4.0 (5.7–7.2) |
tr | America 1 5 86.7 (87.7–89) 13.3 (11.0–12.3) |
td | America |
td | 1 |
td | 5 |
td | 86.7 (87.7–89) |
td | 13.3 (11.0–12.3) |
table-wrap-foot | Note.— No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. |
footnote | Note.— |
p | Note.— |
footnote | No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. |
p | No comparisons were performed for the CD209 neck region, because virtually no variation was observed at that locus. |
table-wrap-foot | a Populations are grouped as described by Rosenberg et al. (2002). |
footnote | a Populations are grouped as described by Rosenberg et al. (2002). |
label | a |
p | Populations are grouped as described by Rosenberg et al. (2002). |
table-wrap-foot | b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). |
footnote | b AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). |
label | b |
p | AMOVA values are from our CD209L study; 95% CIs are defined from 377 autosomal microsatellites in the same population panel (Rosenberg et al. 2002). |
p | Taken together, the integration of the results from levels of nucleotide and amino acid diversity, neutrality tests, population-genetics distances, and neck-region length variation in CD209 and CD209L clearly points to a situation in which CD209 has been under a strong selective constraint that prevents accumulation of any of amino acid changes over time, whereas CD209L variability has most likely been driven by the action of balancing selection, at least in non-African populations. |
sec | The Footprints of Ancestral Population Diversity In apparent dichotomy with the strong selective constraint described for CD209, we observed an unusual excess of diversity of 35 fixed differences separating the two basal branches of the gene tree (fig. 6). In addition, we estimated a T MRCA of 2.8±0.22 MYA, a time that places the most recent common ancestor of CD209 back in the Pliocene epoch, before the estimated time for the origins of the genus Homo ∼1.9 MYA (Wood 1996; Wood and Collard 1999). A number of studies have already reported loci that present unusually deep coalescent times (Harris and Hey 1999; Zhao et al. 2000; Webster et al. 2003; Garrigan et al. 2005a, 2005b), but our estimation for CD209 remains one of the deepest T MRCA values yet reported (Excoffier 2002). The probability of finding such a deep coalescence time under a scenario of a random-mating population was estimated, through a coalescent process (Laval and Excoffier 2004), to be very low (P=.018) (see fig. 7 ). In addition to the unexpected antiquity of the CD209 locus, we observed a peculiar tree topology made of two highly divergent and frequency-unbalanced lineages, cluster A embracing only 2 internal haplotypes and cluster B comprising the remaining 23 (fig. 6). Figure 7 Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. Different hypotheses can account for such elongated and divergent haplotype patterns. Indeed, the high levels of nucleotide identity between CD209 and CD209L could have led to gene conversion between the two genes, an event that would explain the outlier position of cluster A in the context of CD209 phylogeny. We reasoned that if gene conversion has occurred, we expect that the derived alleles distinguishing clusters A and B in CD209 would correspond to the allelic state observed in their homologous positions in CD209L. Of all positions, only four fit this criterion. In addition, these positions were not physically clustered, which therefore excludes a major gene-conversion event as the explanation of the divergent CD209 phylogeny. Two other circumstances may be responsible for the topology and the time depth of the CD209 gene tree: long-standing balancing selection or ancient population structure, with Africa, in both cases, being the arena of such events (i.e., cluster A is restricted to Africa). Several lines of evidence argue against the balancing-selection hypothesis. First, under this selective regime, one would expect that Tajima's D test would also point in this direction by yielding significantly positive values, which is not the case (table 2). Second, such a long-standing balancing selection in Africa would have entailed a number of recombinant haplotypes between clusters A and B, which, again, is not the case, as illustrated by the high LD levels at CD209 (fig. 3). Third, a claim of balancing selection at this locus must imply a functional difference between the two balanced alleles. Indeed, three nonsynonymous mutations, situated in the neck region, separate cluster A and B, and they could correspond to the alleles under selection. But, if the neck region is the target of selection, it is more likely that the balanced alleles would correspond to different numbers of repeats rather than punctual nucleotide variation within each track, as observed for CD209L and suggested by functional studies (Bernhard et al. 2004; Feinberg et al. 2005). Since no variation in the number of repeats was detected between both clusters, we predict that there are no major functional differences between the two lineages. Taken together, maintenance of ancient lineages by balancing selection does not seem to be responsible for the observed haplotype divergence. In this view, the patterns observed are best explained by an ancestral population structure on the African continent. Indeed, several studies have already proposed that African populations must have been more strongly subdivided and isolated than non-African ones (Harris and Hey 1999; Labuda et al. 2000; Excoffier 2002; Goldstein and Chikhi 2002; Harding and McVean 2004; Satta and Takahata 2004; Garrigan et al. 2005a). In particular, a recent study of the Xp21.1 locus presented convincing statistical evidence that supports the hypothesis that our species does not descend from a single, historically panmictic population (Garrigan et al. 2005a). The divergent haplotype pattern observed at the Xp21.1 locus prompted those authors to explain their data under the isolation-and-admixture (IAA) model and/or a metapopulation model (Harding and McVean 2004; Wakeley 2004). Indeed, as observed for CD209, under an IAA model, the two basal branches are expected to be longer than those under a Wright-Fisher model, depending on the length of time subpopulations spent in isolation. The extent to which the IAA model fits the data depends on the number of mutations, referred as to “congruent sites,” occurring in the two basal branches of the genealogy. For Xp21.1, 10 congruent sites over 24 polymorphisms were observed (i.e., ∼42% of the total number of sites). We applied the same approach to CD209 and obtained a very similar percentage of ∼45%, in good accordance with the IAA model. Our observations, together with a number of autosomal diversity studies, show that modern human diversity appears to have kept genetic traces of admixture among archaic hominid populations. However, a number of questions remain unanswered, such as the time when these admixture events occurred (i.e., before or after the appearance of anatomically modern humans), the precise quantitative contribution of ancient genetic material to our modern gene pool, and the geographic provenance of these genetic vestiges. |
title | The Footprints of Ancestral Population Diversity |
p | In apparent dichotomy with the strong selective constraint described for CD209, we observed an unusual excess of diversity of 35 fixed differences separating the two basal branches of the gene tree (fig. 6). In addition, we estimated a T MRCA of 2.8±0.22 MYA, a time that places the most recent common ancestor of CD209 back in the Pliocene epoch, before the estimated time for the origins of the genus Homo ∼1.9 MYA (Wood 1996; Wood and Collard 1999). A number of studies have already reported loci that present unusually deep coalescent times (Harris and Hey 1999; Zhao et al. 2000; Webster et al. 2003; Garrigan et al. 2005a, 2005b), but our estimation for CD209 remains one of the deepest T MRCA values yet reported (Excoffier 2002). The probability of finding such a deep coalescence time under a scenario of a random-mating population was estimated, through a coalescent process (Laval and Excoffier 2004), to be very low (P=.018) (see fig. 7 ). In addition to the unexpected antiquity of the CD209 locus, we observed a peculiar tree topology made of two highly divergent and frequency-unbalanced lineages, cluster A embracing only 2 internal haplotypes and cluster B comprising the remaining 23 (fig. 6). Figure 7 Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. |
figure | Figure 7 Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. |
label | Figure 7 |
caption | Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. |
p | Coalescent-based simulations (2×104) of the expected TMRCA distribution of CD209. |
p | Different hypotheses can account for such elongated and divergent haplotype patterns. Indeed, the high levels of nucleotide identity between CD209 and CD209L could have led to gene conversion between the two genes, an event that would explain the outlier position of cluster A in the context of CD209 phylogeny. We reasoned that if gene conversion has occurred, we expect that the derived alleles distinguishing clusters A and B in CD209 would correspond to the allelic state observed in their homologous positions in CD209L. Of all positions, only four fit this criterion. In addition, these positions were not physically clustered, which therefore excludes a major gene-conversion event as the explanation of the divergent CD209 phylogeny. |
p | Two other circumstances may be responsible for the topology and the time depth of the CD209 gene tree: long-standing balancing selection or ancient population structure, with Africa, in both cases, being the arena of such events (i.e., cluster A is restricted to Africa). Several lines of evidence argue against the balancing-selection hypothesis. First, under this selective regime, one would expect that Tajima's D test would also point in this direction by yielding significantly positive values, which is not the case (table 2). Second, such a long-standing balancing selection in Africa would have entailed a number of recombinant haplotypes between clusters A and B, which, again, is not the case, as illustrated by the high LD levels at CD209 (fig. 3). Third, a claim of balancing selection at this locus must imply a functional difference between the two balanced alleles. Indeed, three nonsynonymous mutations, situated in the neck region, separate cluster A and B, and they could correspond to the alleles under selection. But, if the neck region is the target of selection, it is more likely that the balanced alleles would correspond to different numbers of repeats rather than punctual nucleotide variation within each track, as observed for CD209L and suggested by functional studies (Bernhard et al. 2004; Feinberg et al. 2005). Since no variation in the number of repeats was detected between both clusters, we predict that there are no major functional differences between the two lineages. Taken together, maintenance of ancient lineages by balancing selection does not seem to be responsible for the observed haplotype divergence. In this view, the patterns observed are best explained by an ancestral population structure on the African continent. Indeed, several studies have already proposed that African populations must have been more strongly subdivided and isolated than non-African ones (Harris and Hey 1999; Labuda et al. 2000; Excoffier 2002; Goldstein and Chikhi 2002; Harding and McVean 2004; Satta and Takahata 2004; Garrigan et al. 2005a). In particular, a recent study of the Xp21.1 locus presented convincing statistical evidence that supports the hypothesis that our species does not descend from a single, historically panmictic population (Garrigan et al. 2005a). The divergent haplotype pattern observed at the Xp21.1 locus prompted those authors to explain their data under the isolation-and-admixture (IAA) model and/or a metapopulation model (Harding and McVean 2004; Wakeley 2004). Indeed, as observed for CD209, under an IAA model, the two basal branches are expected to be longer than those under a Wright-Fisher model, depending on the length of time subpopulations spent in isolation. The extent to which the IAA model fits the data depends on the number of mutations, referred as to “congruent sites,” occurring in the two basal branches of the genealogy. For Xp21.1, 10 congruent sites over 24 polymorphisms were observed (i.e., ∼42% of the total number of sites). We applied the same approach to CD209 and obtained a very similar percentage of ∼45%, in good accordance with the IAA model. Our observations, together with a number of autosomal diversity studies, show that modern human diversity appears to have kept genetic traces of admixture among archaic hominid populations. However, a number of questions remain unanswered, such as the time when these admixture events occurred (i.e., before or after the appearance of anatomically modern humans), the precise quantitative contribution of ancient genetic material to our modern gene pool, and the geographic provenance of these genetic vestiges. |
sec | Conclusions The need of continuous evolution for both the human host and the pathogens is predicted by the Red Queen hypothesis (Van Valen 1973; Bell 1982), in reference to the remark of the Red Queen to Alice in Through the Looking Glass (Carroll 1872): “Now, here, you see, it takes all the running you can do, to keep in the same place.” This metaphor provides a conceptual framework for understanding how interactions between the two species lead to constant natural selection for adaptation and counteradaptation. In this context, one feature exploited by the host immunity genes to increase their defense potential is gene duplication by retention, through conservation of one duplicate, of the currently useful function of the encoded protein, while its twin is liberated to mutate and possibly acquire novel functions (Ohno 1970; Trowsdale and Parham 2004). The lectins CD209 and CD209L represent a prototypic model of a duplicated progeny of ancestral genes that interact with a vast spectrum of pathogens. Our results clearly indicate that these duplicated genes have evolved, and might still evolve, under completely different evolutionary pressures. Whereas one, CD209, shows signals of strong conservation, its paralogue, CD209L, exhibits an excess of sequence diversity compatible with the action of balancing selection. In addition, the strong contrast observed in length variation of the neck region between the two genes may have important consequences in medical genetics. In this context, association studies are now needed that correlate length variation of the neck region and susceptibility to infectious diseases whose etiological agents are known to interact with one (or both) of these lectins. More generally, our study has revealed that even a short segment of the human genome can help uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host immunity genes, as well as traces of ancient population structure in the African continent. The coming years will certainly bring unprecedented large data sets of sequence diversity, genomewide and populationwide, with each genomic region possibly revealing a different aspect of human history. The integration of all these apparently independent pieces of the same reality will provide us with a much broader and more realistic view of the demographic history of the human species, as well as of human adaptation to the different environmental conditions imposed not only by pathogens but also by other major factors such as climate and nutritional resources. |
title | Conclusions |
p | The need of continuous evolution for both the human host and the pathogens is predicted by the Red Queen hypothesis (Van Valen 1973; Bell 1982), in reference to the remark of the Red Queen to Alice in Through the Looking Glass (Carroll 1872): “Now, here, you see, it takes all the running you can do, to keep in the same place.” This metaphor provides a conceptual framework for understanding how interactions between the two species lead to constant natural selection for adaptation and counteradaptation. In this context, one feature exploited by the host immunity genes to increase their defense potential is gene duplication by retention, through conservation of one duplicate, of the currently useful function of the encoded protein, while its twin is liberated to mutate and possibly acquire novel functions (Ohno 1970; Trowsdale and Parham 2004). The lectins CD209 and CD209L represent a prototypic model of a duplicated progeny of ancestral genes that interact with a vast spectrum of pathogens. Our results clearly indicate that these duplicated genes have evolved, and might still evolve, under completely different evolutionary pressures. Whereas one, CD209, shows signals of strong conservation, its paralogue, CD209L, exhibits an excess of sequence diversity compatible with the action of balancing selection. In addition, the strong contrast observed in length variation of the neck region between the two genes may have important consequences in medical genetics. In this context, association studies are now needed that correlate length variation of the neck region and susceptibility to infectious diseases whose etiological agents are known to interact with one (or both) of these lectins. |
p | More generally, our study has revealed that even a short segment of the human genome can help uncover an extraordinarily complex evolutionary history, including different pathogen pressures on host immunity genes, as well as traces of ancient population structure in the African continent. The coming years will certainly bring unprecedented large data sets of sequence diversity, genomewide and populationwide, with each genomic region possibly revealing a different aspect of human history. The integration of all these apparently independent pieces of the same reality will provide us with a much broader and more realistic view of the demographic history of the human species, as well as of human adaptation to the different environmental conditions imposed not only by pathogens but also by other major factors such as climate and nutritional resources. |
back | Acknowledgments We warmly acknowledge Guillaume Laval for useful suggestions on the use of SIMCOAL software, Laurent Excoffier and Francesca Luca for stimulating discussions, and two reviewers for constructive comments on the first version of the manuscript. L.B.B. was supported by Fundação para a Ciência e a Tecnologia fellowship SFRH/BD/18580/2004. The URLs for data presented herein are as follows: Arlequin, http://lgb.unige.ch/arlequin/ BOTTLENECK, http://www.montpellier.inra.fr/CBGP/softwares/bottleneck/bottleneck.html Center for Statistical Genetics, http://www.sph.umich.edu/csg/abecasis/GOLD/ (for GOLD software) Centre National de Genotypage, http://software.cng.fr/ (for GENALYS software) DnaSP, http://www.ub.es/dnasp/ GENETREE Software, http://www.stats.ox.ac.uk/∼griff/software.html HGDP-CEPH Human Genome Diversity Cell Line Panel, http://www.cephb.fr/HGDP-CEPH-Panel/ Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for dendritic cell–specific ICAM-3 grabbing nonintegrin and liver/lymph node–specific ICAM-3 grabbing nonintegrin) Phase, http://www.stat.washington.edu/stephens/phase.html SIMCOAL2, http://cmpg.unibe.ch/software/simcoal2/ |
ack | Acknowledgments We warmly acknowledge Guillaume Laval for useful suggestions on the use of SIMCOAL software, Laurent Excoffier and Francesca Luca for stimulating discussions, and two reviewers for constructive comments on the first version of the manuscript. L.B.B. was supported by Fundação para a Ciência e a Tecnologia fellowship SFRH/BD/18580/2004. |
title | Acknowledgments |
p | We warmly acknowledge Guillaume Laval for useful suggestions on the use of SIMCOAL software, Laurent Excoffier and Francesca Luca for stimulating discussions, and two reviewers for constructive comments on the first version of the manuscript. L.B.B. was supported by Fundação para a Ciência e a Tecnologia fellowship SFRH/BD/18580/2004. |
appendix | The URLs for data presented herein are as follows: Arlequin, http://lgb.unige.ch/arlequin/ BOTTLENECK, http://www.montpellier.inra.fr/CBGP/softwares/bottleneck/bottleneck.html Center for Statistical Genetics, http://www.sph.umich.edu/csg/abecasis/GOLD/ (for GOLD software) Centre National de Genotypage, http://software.cng.fr/ (for GENALYS software) DnaSP, http://www.ub.es/dnasp/ GENETREE Software, http://www.stats.ox.ac.uk/∼griff/software.html HGDP-CEPH Human Genome Diversity Cell Line Panel, http://www.cephb.fr/HGDP-CEPH-Panel/ Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for dendritic cell–specific ICAM-3 grabbing nonintegrin and liver/lymph node–specific ICAM-3 grabbing nonintegrin) Phase, http://www.stat.washington.edu/stephens/phase.html SIMCOAL2, http://cmpg.unibe.ch/software/simcoal2/ |
sec | The URLs for data presented herein are as follows: Arlequin, http://lgb.unige.ch/arlequin/ BOTTLENECK, http://www.montpellier.inra.fr/CBGP/softwares/bottleneck/bottleneck.html Center for Statistical Genetics, http://www.sph.umich.edu/csg/abecasis/GOLD/ (for GOLD software) Centre National de Genotypage, http://software.cng.fr/ (for GENALYS software) DnaSP, http://www.ub.es/dnasp/ GENETREE Software, http://www.stats.ox.ac.uk/∼griff/software.html HGDP-CEPH Human Genome Diversity Cell Line Panel, http://www.cephb.fr/HGDP-CEPH-Panel/ Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for dendritic cell–specific ICAM-3 grabbing nonintegrin and liver/lymph node–specific ICAM-3 grabbing nonintegrin) Phase, http://www.stat.washington.edu/stephens/phase.html SIMCOAL2, http://cmpg.unibe.ch/software/simcoal2/ |
p | The URLs for data presented herein are as follows: |
p | Arlequin, http://lgb.unige.ch/arlequin/ |
p | BOTTLENECK, http://www.montpellier.inra.fr/CBGP/softwares/bottleneck/bottleneck.html |
p | Center for Statistical Genetics, http://www.sph.umich.edu/csg/abecasis/GOLD/ (for GOLD software) |
p | Centre National de Genotypage, http://software.cng.fr/ (for GENALYS software) |
p | DnaSP, http://www.ub.es/dnasp/ |
p | GENETREE Software, http://www.stats.ox.ac.uk/∼griff/software.html |
p | HGDP-CEPH Human Genome Diversity Cell Line Panel, http://www.cephb.fr/HGDP-CEPH-Panel/ |
p | Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for dendritic cell–specific ICAM-3 grabbing nonintegrin and liver/lymph node–specific ICAM-3 grabbing nonintegrin) |
p | Phase, http://www.stat.washington.edu/stephens/phase.html |
p | SIMCOAL2, http://cmpg.unibe.ch/software/simcoal2/ |