PMC:1271393 / 17587-40036
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"16252244-14574645-2049631","span":{"begin":9741,"end":9745},"obj":"14574645"},{"id":"16252244-15361935-2049632","span":{"begin":11719,"end":11723},"obj":"15361935"},{"id":"16252244-10975799-2049633","span":{"begin":13595,"end":13599},"obj":"10975799"},{"id":"16252244-11257134-2049634","span":{"begin":13739,"end":13743},"obj":"11257134"},{"id":"16252244-7800710-2049635","span":{"begin":20809,"end":20813},"obj":"7800710"},{"id":"16252244-14527305-2049636","span":{"begin":21995,"end":21999},"obj":"14527305"},{"id":"16252244-12433581-2049637","span":{"begin":22148,"end":22152},"obj":"12433581"}],"text":"Results\nWe determined sequence diversity in the CD209 and CD209L genes (fig. 1) as well as length variation of the neck region in 254 chromosomes originating from three major ethnic groups: sub-Saharan Africans, Europeans, and East Asians. In addition, the orthologous sequences were obtained in four chimpanzees, to infer the ancestral state at each site, to estimate the divergence between humans and chimpanzees, and to perform a number of interspecies neutrality tests.\n\nPatterns of Nucleotide and Haplotype Diversity in the CD209/CD209L Region\nFor CD209, we identified a total of 79 SNPs and 2 indels, including 5 nonsynonymous, 5 synonymous, and 71 noncoding variants. The five nonsynonymous SNPs were all located in the neck region (exon 4): SNPs 1839 (Arg→Gln), 1888 (Glu→Asp), and 1908 (Arg→Gln) achieved a frequency of ∼15%, and SNP 1970 (Leu→Val), a frequency of 6%. These mutations were restricted to the African sample. SNP 1472 (Ala→Thr) was observed as a singleton in an East-Asian individual. For CD209L, we identified 64 SNPs and 2 indels, including 4 nonsynonymous and 62 noncoding variants. The four nonsynonymous variants were located in different exons: SNP 141 (Thr→Ala) in exon 2, SNP 3476 (Asp→Asn) in exon 5, SNP 4268 (Thr→Ala) in exon 6, and SNP 5580 (Arg→Gln) in exon 7. All these mutations were singletons except SNP 3476, which presented high frequencies for its derived allele in all geographic regions: 97.6% in Africans, 57% in Europeans, and 77% in East Asians. All variable sites were in Hardy-Weinberg equilibrium for both CD209 and CD209L, after Bonferroni correction for multiple testing.\nThe allelic composition of CD209 and CD209L haplotypes and their frequency distribution in the three major ethnic groups is illustrated in figure 2 , along with the haplotype composed of the ancestral allelic state of each SNP inferred from chimpanzee data. For CD209, we identified 42 different haplotypes, with an overall heterozygosity of 84% (table 2 ). Three major haplotypes (H2, H29, and H40) accounted for ∼50% of the African variability, whereas they were at very low frequency (H2 at ∼5%) or absent (H29 and H40) in Europeans and East Asians (fig. 2A). In turn, the two haplotypes (H1 and H3) that accounted for 58% and 83% of the European and East Asian variability, respectively, were observed at very low frequency (H1 at 6%) or even absent (H3) in Africa. However, H3, which had a frequency of 36% and 20% in Europe and East Asia, respectively, is just a one-step mutation (SNP 871) from H2, the most frequent haplotype in the African sample. The most interesting observation of the CD209 haplotype variability was the presence of a highly divergent haplotype cluster. This cluster, which contains haplotypes 40–42 (referred to here as “cluster A”), differs from all other haplotypes (referred to here as “cluster B”) by 35 fixed positions (fig. 2A). Cluster A is Africa specific and is present at a frequency of ∼15%, whereas cluster B is present in the remaining African and all non-African samples. It is worth noting that three (SNPs 1839, 1888, and 1908) of the five nonsynonymous mutations identified for this gene are unique to cluster A. In all cases, these three mutations were segregating together, with the exception of one haplotype, H41, which does not contain the SNP 1839. Samples from cluster A are geographically widespread over the entire African continent (i.e., two San from Namibia, three Bantus from Gabon and two from South Africa, three Yorubans from Nigeria, and two Mandenka from Senegal). For CD209L, 74 different haplotypes were observed (fig. 2B), with an overall heterozygosity of 94% (table 2). Only one haplotype (H38) at a frequency of ∼15% was shared in the three continental regions.\nFigure 2 Inferred haplotypes for CD209 (A) and CD209L (B). The chimpanzee sequence was used to deduce the ancestral state at each position, except for the CD209L positions 1232, 1236, and 1240. For those polymorphisms, the ancestral state was considered to be the most frequent allele. Dark boxes correspond to the derived state at each position. The numbers on the right of the figure indicate the absolute frequency of each haplotype in the different populations studied. Repeat-number variation in the neck region of each gene is reported in the gray columns with the column heads “NR.” Indel polymorphisms are referred as to “1” for insertion and “0” for deletion.\nTable 2 Summary of Diversity Indexes and Sequence-Based Neutrality Tests in the Study Populations\nGene and Population No. of Chromosomes No. of Segregating Sites No. of Haplotypes HDa±SD πb ±SD θwc Tajima's D Fay and Wu's H\nCD209:\n African 82 70 26 91.8 ± 1.6 26 ± 3.8 25.3 −.05 −19.45d\n European 86 18 14 79.6 ± 3.0 6.4 ± .6 6.5 −.04 −.26\n East Asian 86 12 11 56.7 ± 5.5 3.3 ± .5 4.3 −.65 −3.82d\n Total 254 79 42 84.5 ± 1.6 13 ± 1.7 23.3\nCD209L:\n African 82 51 40 94.9 ± 1.2 16.1 ± .9 18.7 −.49 −1.52\n European 86 29 23 88.8 ± 1.9 17.7 ± 1.0 10.5 2.01e −.61\n East Asian 86 27 19 86.4 ± 1.8 16.0 ± .5 9.8 1.85d −.43\n Total 254 63 74 93.6 ± .7 17.7 ± .5 18.8\nNote.—\nThe values shown in bold italics correspond to significant values for both the coalescence simulation and the empirical distribution (see the “Material and Methods” section). The analyses considered a total of 5,500 and 5,391 nucleotides for CD209 and CD209L, respectively.\na HD = haplotype diversity (%).\nb Nucleotide diversity per base pair (×10−4).\nc Watterson's estimator per base pair (×10−4).\nd .02\u003cP≤.05.\ne P≤.02.\nTo assess the degree of population differentiation, if any, we computed Wright's F ST (Wright 1931), using haplotype frequencies. F ST estimates were significant (P\u003c.0001) for all population comparisons, indicating continental differentiation for both CD209 and CD209L. However, substantial differences were observed between the two genes: the overall F ST for CD209 among Africans, Europeans, and East-Asians was 0.15, whereas CD209L presented a threefold lower F ST value of 0.05. For both genes, the larger F ST values were observed between African and East Asian populations, with F ST values of 0.22 for CD209 and 0.07 for CD209L.\n\nLevels of Polymorphism and Divergence between Humans and Chimpanzees\nThe average nucleotide diversity (π) was strikingly different, both between the two genes and among populations (table 2). Globally, π values were three- to fivefold lower for CD209 (3–7×10−4) than for CD209L (∼16 × 10−4), except for African populations, for whom the CD209 π value was unusually high (26×10−4) because of the presence of the highly divergent cluster A. Indeed, when cluster A was excluded from the analysis, the African π value dropped to 8×10−4. To estimate the substitution rate of each region and evince possible mutational differences that could explain the strong contrast observed in nucleotide-diversity patterns, we determined the human-chimpanzee divergence for both genes. The average net number of differences between the two species was 77.3 substitutions (or 0.0157 substitutions per nucleotide) for CD209 and 90.6 substitutions (or 0.0171 substitutions per nucleotide) for CD209L. Since the human-chimpanzee speciation occurred 5 MYA, we obtained similar nucleotide-substitution rates per site per year (CD209, 1.57×10−9; CD209L, 1.70×10−9).\n\nLD\nTo assess the patterns of LD in the CD209/CD209L region, haplotypes for the entire genomic region were reconstructed using markers with an MAF of 10%. D′ measures among these markers were estimated for African and non-African populations independently; the graphical representation of LD levels is illustrated in figure 3 . Two distinct regions, which correspond to either CD209 or CD209L, showed strong LD and are separated by a boundary that corresponds to the intergenic region. For CD209, a block of intragenic LD was observed in both African and non-African populations. For the African sample, 89% of all pairwise comparisons indicated significant levels of LD, whereas, for non-Africans, all D′ pairwise comparisons were significant. The magnitude of intragenic recombination (and/or gene conversion) of CD209L was slightly higher than for CD209. Nevertheless, considerable and significant levels of LD were observed between sites: 83% of all LD pairwise comparisons were significant in the African group, and 99% were in the non-African sample. Overall, CD209 exhibited a blocklike structure in both groups, whereas CD209L presented lower—although mostly significant—LD levels, in particular among the non-African sample.\nFigure 3 Pairwise D′ LD plots in non-African and African populations. European and East Asian samples were plotted together as “non-Africans” because they showed similar levels of LD (data not shown). Red tags indicate the physical position of each SNP across the genomic region studied. Blue and green lines label the SNPs (MAF\u003e10%) used for CD209 and CD209L, respectively, in the LD plot. For CD209, 47 SNPs presented an MAF\u003e10% in the African sample and 5 in the non-African, whereas, for CD209L, 18 SNPs showed an MAF\u003e10% in Africans and 20 in non-Africans. The high prevalence of SNPs with MAF\u003e10% for CD209 in Africa is due to the presence of the highly divergent cluster A, which presents 35 diagnostic variants with a frequency of 15%.\nThe strong decay in LD observed in the intergenic region (fig. 3), which spans only ∼14 kb, suggests the occurrence of a number of recombination events. To test the hypothesis of a possible recombination hotspot situated within this region, recombination parameters across the entire CD209/CD209L region (∼26 kb) were computed for the three populations, by use of the recombination model implemented in Phase (v.2.1.1) (fig. 4 ). This model (Stephens and Donnelly 2003) estimates the position and relative intensity of the hotspot (λ) as compared with the background population recombination rate (ρ) (see the “Material and Methods” section). A λ value of 1 corresponds to absence of recombination-rate variation, whereas λ values \u003e1 indicate the presence of a hotspot. The model detected the occurrence of a hotspot in the intergenic region, with Africans presenting a λ of 18, whereas Europeans and East Asians exhibited λ values of 63 and 53, respectively (fig. 4). We estimated the posterior probabilities of a hotspot of any kind, Pr(λ\u003e1), and of at least 10 times the background recombination rate, Pr(λ\u003e10). Pr(λ\u003e1) was 100% for all population groups, and Pr(λ\u003e10) was 64% for Africans, 97% for Europeans, and 92% for East Asians. Thus, our data clearly indicate a relative increase of the recombination levels between the two genes, which suggests the occurrence of a hotspot of recombination, the magnitude of which varies among the major ethnic groups. However, our data do not include intergenic SNPs; therefore, the exact location and width of the recombination hotspot within the intergenic region remains unclear, since this observation would be consistent with either an intense narrow hotspot or a weaker but wider hotspot.\nFigure 4 Estimates of the hotspot intensity (λ) for Africans, Europeans, and East Asians. Estimates of the population recombination rate (ρ) for each population as well as the posterior probabilities of λ\u003e1 and λ\u003e10 are also reported in the key.\n\nNeutrality Tests\nThe identification of a strong decay in LD between CD209 and CD209L facilitated the interpretation of neutrality tests, because the noise introduced by hitchhiking effects between the genes is reduced. We applied Tajima's D and Fay and Wu's H tests to determine whether these statistics significantly deviated from expectations under neutrality, using both coalescent simulations and the empirical distribution obtained from Akey et al. (2004). Globally, Tajima's D test indicated different tendencies for the two genes (table 2). CD209 always yielded negative values for Tajima's D but never achieved significance to reject the hypothesis of neutrality, whereas CD209L yielded significantly positive values for non-African populations, with use of both coalescent simulations and the empirical distribution. For Fay and Wu's H test, the hypothesis of neutrality was rejected for CD209 in the African and East Asian samples (table 2).\nTo evaluate the selective pressures at the protein level, we performed two interspecies tests: K A/K S, which gives the ratio of nonsynonymous and synonymous changes between species, and the McDonald-Kreitman test, which tests the null hypothesis that the ratio of the number of fixed differences to polymorphisms is the same for both nonsynonymous and synonymous mutations. For the K A/K S test, CD209 and CD209L showed similar values, 0.34 and 0.37, respectively. For the McDonald-Kreitman test, the hypothesis of neutrality was rejected for only CD209, because of a clear lack of nonsynonymous polymorphic sites (table 3 ).\nTable 3 McDonald-Kreitman Test Results\nNo. of Substitutions andPValue for\nExonic Region Only Entire Sequencea\nGene and Type of Site Synonymous Nonsynonymous P Synonymous Nonsynonymous P\nCD209: .04 .009\n Fixed 4 5 51 5\n Polymorphic 6 0 86 0\nCD209L: .23 1\n Fixed 5 6 78 6\n Polymorphic 0 4 65 4\nNote.—\nThe highly variable exon 4 has been excluded from this analysis, because no ancestral state could be inferred. Significant P values are shown in bold italics.\na Mutations in introns are considered synonymous.\n\nNeck-Region Length Variation in Worldwide Populations\nThe identical genomic organization of CD209 and CD209L is extended to the neck region, which, in both genes, encodes a track of seven coding repeats of 23 aa each (fig. 1) (Soilleux et al. 2000). A previous study has shown that the length of the neck region of CD209L varied between individuals of European descent (Bashirova et al. 2001). To investigate the degree of polymorphism of the neck region in both CD209 and CD209L, we genotyped it in the entire HGDP-CEPH panel (1,064 individuals from 52 worldwide populations). Striking differences were observed between the two genes (see fig. 5 and table 4 for detailed allele frequencies in each population). For CD209, virtually no variation was observed, and the 7-repeat allele accounted for 99% of the total variability. Despite this limited variation, eight different alleles were observed, with an allele size range of 2–10 repeats, not including a 9-repeat allele. The geographic region that presented the highest variability was the Middle East, with five of the eight different alleles observed (fig. 5A and table 4). For CD209L, a completely different pattern emerged, with strong variation in allelic frequencies of different repeat numbers. Of the seven alleles observed (from 4–10-repeat allele size classes), the three most common overall were the 7- (57.42%), the 5- (23.92%), and the 6- (11.37%) repeat alleles. European, Asian, and Pacific populations presented a mosaic composition of different allelic classes, whereas 7- and 6-repeat alleles accounted for most (96%) of the African diversity (fig. 5B). The strong difference in the neck-region lengths between the two genes was consequently visible in the heterozygosity values: CD209 exhibited an overall heterozygosity of only 2%, whereas CD209L presented a value of 54% (table 5 table 5). Our results showed that the levels of heterozygosity observed at CD209 were considerably lower than expected, regardless of the mutation model considered (i.e., Infinite Site or Stepwise Mutation Models) (table 5). In strong contrast, although not statistically significant for individual populations, CD209L exhibited a pattern of an excess of heterozygosity in all populations.\nFigure 5 Geographical distribution of the neck-region repeat variation in CD209 (A) and CD209L (B). Population codes are (1) Algerians; (2) Mandenka; (3) Yoruba; (4) Biaka Pygmies; (5) Northeastern Bantu from Kenya; (6) Mbuti Pygmies; (7) San; (8) South African Bantu southeastern/southwestern; (9) French and Basque from France; (10) Italian composite from Bergamo, Tuscany, and Sardinia; (11) Orcadian; (12) Russians; (13) Adygei; (14) Middle Eastern composite sample of Druze, Palestinian, and Bedouin; (15) Yakut; (16) Pakistani composite sample; (17) Chinese composite sample; (18) Japanese; (19) Cambodian; (20) Papuan; (21) Melanesian; (22) Pima; (23) Maya; (24) Piapoco and Curripaco; (25) Surui; and (26) Karitiana. For populations 16 and 17, we have pooled the different Pakistani and Chinese individual populations, respectively. For population details of these two composite groups, see the HGDP-CEPH Web site.\nTable 4 Allele Relative Frequencies of Neck-Region Repeat Variation in CD209 and CD209L in Individual Populations\nCD209 CD209L\nRelative Frequency (%) by No. of Repeats Relative Frequency (%) by No. of Repeats\nLocation and Population Geographic Origin No. of Chromosomes 10 8 7 6 5 4 3 2 HZa 10 9 8 7 6 5 4 HZb\nAfrica: 254 .39 99.21 .39 .02 .39 62.20 33.86 3.54 .50\n Biaka Pygmies Central African Republic 72 100 65.28 30.56 4.17 .47\n Mbuti Pygmies Democratic Republic of Congo 30 100 43.33 56.67 .47\n Bantu, northeastern Kenya 24 100 50.00 37.50 12.50 .83\n San Namibia 14 100 35.71 64.29 .71\n Yoruban Nigeria 50 2.00 98.00 .04 2.00 78.00 20.00 .32\n Mandenkan Senegal 48 97.92 2.08 .04 66.67 29.17 4.17 .54\n Bantu, southeastern/southwestern South Africa 16 100 62.50 31.25 6.25 .50\nEurope: 322 99.69 .31 .01 1.86 43.17 14.91 33.54 6.52 .62\n French France 58 100 48.28 12.07 36.21 3.45 .55\n French (Basque) France 48 100 39.58 8.33 39.58 12.50 .50\n Sardinian Italy 72 100 1.39 31.94 22.22 34.72 9.72 .61\n North Italian Italy (Bergamo) 28 100 .00 46.43 21.43 28.57 3.57 .79\n Orcadian Orkney Islands 32 100 9.38 46.88 9.38 28.13 6.25 .69\n Russian Russia 50 100 2.00 48.00 12.00 34.00 4.00 .84\n Adygei Russian Caucasus 34 97.06 2.94 .06 2.94 50.00 17.65 26.47 2.94 .35\nMiddle East: 356 .28 97.19 1.97 .28 .28 .06 .84 .28 56.46 17.13 24.72 .56 .61\n Druze Israel (Carmel) 96 96.88 3.13 .06 1.04 1.04 53.13 21.88 22.92 .67\n Palestinian Israel (Central) 102 .98 99.02 .02 .98 56.86 14.71 27.45 .65\n Bedouin Israel (Negev) 98 96.94 3.06 .06 1.02 58.16 14.29 24.49 2.04 .51\n Mozabite Algeria (Mzab) 60 95.00 1.67 1.67 1.67 .1 58.33 18.33 23.33 .60\nCentral/South Asia: 420 .24 99.29 .24 .24 .01 3.81 .95 63.57 4.29 27.38 .52\n Pakistanib Pakistan 400 .25 99.25 .25 .25 .02 3.50 1.00 63.50 4.25 27.75 .52\n Uygur China 20 100 10.00 65.00 5.00 20.00 .50\nEast Asia: 482 .21 99.38 .21 .21 .01 11.83 .21 70.12 2.49 15.35 .47\n Cambodian Cambodia 22 100 18.18 68.18 4.55 9.09 .36\n Chinesec China 348 99.43 .29 .29 .01 12.07 .29 71.26 2.30 14.08 .45\n Japanese Japan 62 1.61 98.39 .03 6.45 62.90 3.23 27.42 .58\n Yakut Siberia 50 100 14.00 72.00 2.00 12.00 .48\nOceania: 78 100 3.85 26.92 30.77 21.79 16.67 .72\n Papuan New Guinea 34 100 41.18 29.41 11.76 17.65 .65\n NAN Melanesian Bougainville 44 100 6.82 15.91 31.82 29.55 15.91 .77\nAmericas: 216 98.61 1.39 .03 8.80 43.98 47.22 .45\n Karitiana Brazil 48 100 4.17 56.25 39.58 .54\n Surui Brazil 42 92.86 7.14 .14 16.67 83.33 .33\n Piapoco and Curripaco Colombia 26 100 19.23 26.92 53.85 .46\n Pima Mexico 50 100 8.00 64.00 28.00 .36\n Mayan Mexico 50 100 16.00 44.00 40.00 .56\n Total 2,128 .05 .14 98.97 .47 .09 .09 .14 .05 .02 .14 5.73 .33 57.42 11.37 23.92 1.08 .54\na Heterozygosity values.\nb Pakistani populations include Balochi, Brahui, Makrani, Sindhi, Pathan, Burusho, Hazara, and Kalash.\nc Chinese populations include Han, Dai, Daur, Hezhen, Lahu, Miao, Orogen, She, Tujia, Tu, Xibo, Yi, Mongola, and Naxi.\nTable 5 Observed and Expected Heterozygosities for the Number of Repeats in the Neck Regions of CD209 and CD209L\nFindings for Neck Regions of\nCD209 CD209L\nHeterozygosity P Heterozygosity P\nPopulation Observed Expecteda ISMb SMMc Observed Expecteda ISMb SMMc\nAfrican 1.6 27.9 .030 .000 50 37 .328 .229\nEuropean .6 15.3 .158 .094 62 44 .179 .304\nMiddle Eastern 5.6 43.1 .018 .000 61 49 .299 .095\nCentral/South Asian 1.4 35.1 .003 .000 52 43 .387 .098\nEast Asian 1.2 34.5 .003 .000 47 42 .472 .054\nOceanian .0 … … … 72 53 .071 .337\nAmerican 2.8 16.3 .323 .205 45 29 .273 .440\nTotal sample 2.0 49.7 .002 .000 54 47 .405 .013\nNote.—\nWe presented only the expected heterozygosity under the infinite-site model, because no evidence for recurrent mutations were observed in our data, as suggested by the composite CD209L haplotypes that included the repeat variation (fig. 2), as well as by the median-joining networks (results not shown). Significant P values are shown in bold italics.\na Under the infinite-site model.\nb Probability of the observed heterozygosity under the infinite-site model.\nc Probability of the observed heterozygosity under the stepwise mutational model.\n\nTime of the Most Recent Common Ancestor for CD209\nThe low levels of intragenic recombination observed in CD209 allowed maximum-likelihood coalescent analysis (Griffiths and Tavare 1994) for estimation of the time scale of the origin and evolution of this gene. Since this method assumes an infinite-site model without recombination, the same analysis for CD209L was not conducted because of the substantial amount of recombinant haplotypes observed. For CD209, only 29 of the 254 chromosomes analyzed had to be excluded, as did a single segregating site (SNP 939). The resulting CD209 gene tree estimate, rooted with the chimpanzee sequence (i.e., the chimpanzee sequence was used to define ancestral/derived status of human mutations), is shown in figure 6 . The tree is partitioned into two deep branches that correspond to haplotype clusters A and B. African samples were observed in both sides of the deepest node of the tree (i.e., in both clusters A and B), whereas non-African samples are restricted to one branch of the tree (i.e., cluster B). The maximum-likelihood estimate of θ (θML) for CD209 was 8.4. On the basis of this θML value and the estimated mutation rate (1.54×10−4 per gene per generation), the effective population size (N e) was 13,636, a value comparable to most figures reported in the literature (for a review, see Tishkoff and Verrelli [2003]). The T MRCA of the CD209 tree was then estimated at 2.8±0.22 MYA, one of the oldest T MRCA values estimated so far in the human genome (Excoffier 2002).\nFigure 6 CD209 estimated gene tree. Time scale is in MYA. Mutations are represented as black dots and are named for their physical position along CD209. For branches with multiple mutations, order in time is arbitrary. Lineage absolute frequencies in Africa, Europe, and East Asia are reported."}