Results The Physical Structure of the DUP4 Structural Variant We initially decided to confirm this structural model by physical mapping of the DUP4 variant using fiber-FISH. We grew lymphoblastoid cells from the known DUP4 heterozygote from the 1000 Genomes project sample collection, HG02554, derived from a man with African ancestry from Barbados, and selected clones to act as fiber-FISH probes from the WIBR-2 human fosmid library spanning the region, based on fosmid end sequences previously mapped to the GRChr37 reference genome. Fiber-FISH showed that the reference haplotype generates signals consistent with the genome reference sequence (Figure 1A). Because of the high sequence identity between the tandemly repeated glycophorin regions, there is extensive cross-hybridization of probes that map to the GYPB repeat with the GYPE and GYPA repeats. The GYPE repeat can be distinguished by hybridization of a small GYPE-repeat-specific PCR product, and the GYPA repeat can be identified by a gap in the green fosmid probe signal, caused by 16 kb of unique sequence in the GYPA repeat. Also, the overlap of the distal end of the blue fosmid probe with the proximal end of the GYPE repeat means that a small amount of blue signal at the distal end of the GYPB and GYPE repeats is detected, confirming repeat length and orientation. We identified DNA fibers showing an arrangement completely consistent with the DUP4 model proposed previously (Figure 1B). Figure 1 Fiber FISH Analysis of the DUP4 Heterozygote Sample HG02554 (A) An example DNA fiber from the reference haplotype. The position and label color of the fosmid probes is indicated above the fiber on a representation of the human reference genome, and the interpretation of the FISH signals shown below the fiber. (B) An example DNA fiber from the DUP4a haplotype. The Leffler model of the DUP4 haplotype is indicated above the fiber. The interpretation of the FISH signals shown below the fiber. (C) An example DNA fiber from the DUP4b haplotype. The interpretation of the FISH signals is shown below the fiber. Identification of a Somatic DUP4 Variant However, we also visualized fibers with an extra partial repeat unit, which we called DUP4b (Figure 1C). This novel variant carries an extra copy of the partial A-B repeat, which harbors the GYPB/GYPA fusion gene. We selected a fosmid probe that spanned the 16 kb insertion specific to the GYPA repeat and showed that the extra copy was at least partly derived from the A repeat, consistent with the extra copy being an extra copy of the partial A-B repeat (Figure S1). To rule out large-scale karyotype changes being responsible for our observations of the additional novel variant (DUP4b), we analyzed metaphase spreads of HG02554 lymphoblastoid cell line using metaphase-FISH, interphase FISH, and multiplex-FISH karyotyping (Figure S2). DUP4 and reference chromosomes could be distinguished by interphase-FISH on the basis of hybridization intensity of a fosmid probe mapping to GYPB (Figure S2B). No evidence of large-scale inter- and intrachromosomal rearrangements or aneuploidy was found in any of our experiments. We hypothesized that DUP4b is a somatic variant that occurred through rearrangement of the original DUP4 variant (which we call DUP4a), but not the reference variant. If this is true, we would expect to observe an equal number of reference and DUP4 fibers from each of the parental chromosomes confirming the heterozygous DUP4 genotype of the source cells, but for the DUP4 fibers to be subdivided into DUP4a and DUP4b variants. Of 24 fibers examined from HG02554, 12 were reference and 12 were DUP4, and, of the 12 DUP4 variants, 7 were DUP4a and 5 were DUP4b, strongly supporting the model where DUP4b is a somatic rearrangement of DUP4a and the presence of two sub-clones (populations) of cells, one with reference and DUP4a haplotypes the other with reference and DUP4b haplotypes. We also analyzed the HG02554 cell line from the Oxford laboratory used in their study5 and confirmed the existence of DUP4b by fiber-FISH. The high frequency of DUP4b variant chromosomes within the cell lines together with the observation of DUP4b in two cell line cultures suggests that DUP4b is a somatic variant of DUP4a that has arisen prior to the passage received by the Oxford laboratory or the Wellcome Sanger Institute, either in the donor individual or early in the cell-culturing process, perhaps increasing in frequency due to the associated transformation cell bottleneck.37 To further characterize the somatic variation observed in HG02554, we Illumina sequenced at high depth (50×) HG02554 DNA purchased directly from Coriell Cell Repositories and extracted from their HG02554 lymphoblastoid cell line rather than extracted from our cell lines, together with peripheral-blood derived genomic DNA from two Tanzanian DUP4 homozygotes and two Tanzanian DUP4 heterozygotes. Analysis of sequence read depth across the glycophorin repeat region showed the same pattern as that observed previously,5 leading to a model that is confirmed by our fiber-FISH data (Figure 2A). DUP4 homozygotes show the expected increase to four copies and six copies in duplicated and triplicated regions, respectively. Figure 2 Sequence Read Depth Analysis of DUP4 Homozygotes and Heterozygotes (A) Normalized sequence read depth of 5 kb windows spanning the reference sequence glycophorin region for five samples. The lines show the Loess regression line (f = 0.1) for homozygotes (blue) and heterozygotes (green). Gene positions and repeats, with respect to the reference sequence, are shown above the plot. (B) The difference in HG02554 sequence read depth compared to the average sequence read depth of the two other heterozygotes C05E and C05P is shown in 5 kb windows across the glycophorin region. Points highlighted in red are significantly different (p < 0.01). We then compared the sequence read depth of HG02554 to the other two DUP4 heterozygotes to search for evidence of an increased copy number of the BAB partial repeat carrying the GYPA/GYPB fusion gene (Figure 1) suggested by our fiber-FISH data, which would reflect somatic mosaicism. HG02554 indeed shows a significant increase in DNA dosage in regions matching the BAB partial repeat, of around about 0.5, reflecting an extra copy of the region in ∼50% of cells (Figure 2B). Development of a Simple Robust DUP4 Genotyping Assay Having characterized the structure of DUP4 variants, we designed a simple robust junction fragment PCR assay that would allow detection of the DUP4 variant (both DUP4a and DUP4b) in nanograms of genomic DNA, at a large scale. This involved designing allele-specific and paralog-specific PCR primers across a known breakpoint, a process made more challenging by the high sequence identity between paralogs. DUP4-specific primers had a modified locked nucleic acid base incorporated in the terminal 3′ nucleotide to enhance specificity for the correct paralog.38 We initially targeted the GYPA-GYPB breakpoint that created the fusion gene but found that a similar breakpoint was present in a frequent gene conversion allele. We therefore designed primers to target the breakpoint between the GYPE repeat and the GYPB repeat, which was predicted to be unique to DUP4. The DUP4 genotype was determined using a duplex PCR approach, with one pair of primers specific for the DUP4 variant and a second pair amplifying across the SNP rs186873296, outside the structurally variable region, acting as a control for PCR amplification. The assay was validated against control samples showing different structural variants,5 and samples showing no structural variation, to ensure DUP4 specificity (Table S1, Figure S3). Association of DUP4 with Malaria-Related Phenotypes The DUP4 genotyping assay allowed us to investigate the association of DUP4 with three quantitative traits related to malaria: hemoglobin levels in peripheral blood, parasite load, and mean number of clinical episodes of malaria, with hemoglobin levels showing the highest heritability of the three phenotypes in this cohort (Table 1). The DUP4 structural variant has previously been associated with both severe cerebral malaria and severe malarial anemia,5, 39 and both are diagnoses related to our quantitative traits. For example, although the causes of hemoglobin level variation between individuals from a malaria-endemic region will be multifactorial, they will be strongly affected by malaria infection status of the individual, with infected individuals showing lower levels of hemoglobin.40 At the extreme low end of the distribution of hemoglobin levels is anemia, a sign of malaria that is one important feature in the pathology of the disease.41 Table 1 Association of DUP4 Allele with Malarial Phenotypes in the Nyamisiti Cohort Phenotype Heritability of Phenotype (95% CI)a Individuals Association p Value Hemoglobin 0.302 (0.136–0.469) 800 0.0054 Parasite load 0.104 (0.002–0.206) 864 0.39 Clinical episodes 0.221 (0.131–0.311) 939 0.72 a Calculated on this cohort using SOLAR30, 31, 46 We analyzed data from a longitudinal study of a population from the village of Nyamisati, in the Rufiji river delta, 150 km south of Dar-es-Salaam, Tanzania, described previously.29, 30, 31 This region was holoendemic for malaria, predominantly P. falciparum, which causes 99.5% of all recorded clinical episodes of malaria. Parasite prevalence was recorded as 75% at the start of the study in 1993, falling to 48% in 1998, as measured by microscopy in the 2- to 9-year-old children. A total of 962 individuals with pedigree information were genotyped; of these 278 were DUP4 heterozygotes and 4 were DUP4 homozygotes. Previous work has suggested that the DUP4 variant is at a frequency of about 3% in the Wasambaa of north-eastern Tanzania,5 and our analysis found an allele frequency of 13.4% (95% confidence intervals 11.0%–16.1%) in the 348 unrelated individuals from Nyamisati village. For these unrelated individuals, 87 were DUP4 heterozygotes and 3 were DUP4 homozygotes, with genotype frequencies in Hardy-Weinberg equilibrium. We used the pedigree and genotype information from our full cohort to test for association of three malaria-related phenotypes with the DUP4 variant. Using a family-based association method modeled in QTDT,36 we found a statistically significant association of the DUP4 variant with hemoglobin levels (p = 0.0054, Table 1). We estimated the direction of effect by comparing the mean corrected hemoglobin levels of unrelated individuals with and without the DUP4 variant. Individuals with the DUP4 allele showed a higher hemoglobin level compared to those without a DUP4 variant, showing that DUP4 variant is associated with higher hemoglobin levels.