@ewha-bio:88
DNA Pooling as a Tool for Case-Control Association Studies of Complex Traits
Case-control studies are widely used for disease gene
mapping using individual genotyping data. However, analyses of large samples are often impractical due to the expense of individual genotyping. The use of DNA pooling can significantly reduce the number of genotyping reactions required; hence reducing the cost of large-scale case-control association studies. Here, we discuss the design and analysis of DNA pooling genetic association studies.
Two types of statistical methods have been widely used to attempt to identify genetic determinants of diseases. One is a linkage analysis and the other is an association study. Linkage analysis methods attempt to find out the rough location of the disease gene relative to another DNA sequence called a genetic marker, which has its position on the chromosome already known. Linkage analysis has been extremely useful in the identification of genes responsible for diseases with simple Mendelian inheritance such as cystic fibrosis (Rommens etal., 1989). Since complex diseases are likely to be influenced by the factors such as genetic heterogeneity, phenocopies, incomplete penetrance, genotype-by-environment interactions, and multilocus effects, the application of linkage analysis to complex diseases has been much less successful. Complex diseases are likely to be influenced by multiple genes of small effect (Elston, 1995). Association studies provide the most powerful approach to identify such genes underlying complex traits (Risch and
Merikangas, 1996). Genetic association studies investigate whether there is a relationship between a “genetic marker” and the frequency or severity of a particular trait. Risch and Merikangas (1996) have shown that association studies can be a very powerful approach to find genetic determinants of a complex disorder. Genetic association studies have been applied in a variety of complex human diseases such as cancer, alcoholism, pulmonary disease, heart disease, diabetes and Alzheimer disease (Silverman and Palmer, 2000; Risch and Merikangas, 1996; Chakravarti, 1999).
Different experimental designs can be used to conduct genetic association studies. Cohort studies prospectively assemble a group of individuals, and then follow them to determine the frequency of developing disease. Cohort studies investigate the frequency of the DNA variant in the entire cohort for the estimation of risk ratio and predictive values. Cohort studies are labor intensive and costly. However, phenotype may be more clearly defined for genetic analysis through the use of longitudinal data due to the ability to observe the natural history of the disease and potentially significant comorbidities. This is an important issue in complex human disease since research into such diseases can be complicated by phenotype heterogeneity. Cohort studies are potentially less prone to ascertainment bias than are case-control studies, although geographic and population factors must be examined explicitly.
Case-control designs for genetic association studies are not different in conception from the case-control designs that have been well developed for use in epidemiological methods (Romero eta/., 2002). Case-control studies examine the frequency of a DNA variant in individuals affected by a disease (cases) and in those not affected by the disorder (controls). Genetic association studies have been mostly performed in a case-control setting with unrelated affected subjects compared with unrelated unaffected subjects. Significant differences in allele frequencies or genotype frequencies between cases and controls are taken as evidence for involvement of an allele or genotype in disease susceptibility.
Case-control studies are potentially susceptible to bias if case and controls are not in fact comparable. One
source of this bias arises due to population stratification. Population stratification occurs when the differences in gene frequency are wholly or partially attributable to inherent underlying population structures between cases and controls rather than to association between disease and gene. Case-control studies are also susceptible to a spurious association by a false-positive result due to chance. The problem is particularly acute for studies involving a large number of markers, such as are used in regional or genome-wide association studies. This association will be less likely if a very stringent threshold for significance is met. Risch and Merikangas (1996) suggest a Bonferroni correction with a p-value of 5x1 O' 8 to demonstrate a significant association if one million SNPs (single nucleotide polymorphisms) were typed across the genome. While Bonferroni corrections (Bonferroni, 1936) are often used when multiple associations are measured in a study, there are some features of a genetic association study which make it unattractive. Bonferroni corrections are estimated by dividing the type I error rate (a) by the number of tests performed; after this procedure, a test is significant only if the p-value is less than this adjusted type I error rate. However, with the number of tests done in a typical genetic association study, it is clear that the Bonferroni correction can significantly lower power for detecting true association. The correction also does not account for the fact that often the test statistics of markers located in near proximity are not independent test and are, in fact highly correlated. For instance, investigating 2 SNPs that are 50 kb apart are likely highly correlated due to linkage disequilibrium. To address this issue, Kaplan et al. (1997) and McIntyre et al. (2000) have proposed a Monte Carlo approach for evaluating associations within the transmission/disequilibrium test (TDT) construct for a single locus and multilocus tests, respectively.
The case-control study remains a popular approach in genetic epidemiology since case-control studies are economically and statistically efficient. Even though ascertainment of disease, selection of controls, and measurement of exposure present substantial difficulties in most case-control studies, a large body of epidemiologic theory provides guidance to meet these challenges. Bias arising from population stratification should be mitigated by proper design and analysis of case-control studies and by new statistical methods such as genomic control. This method, proposed by Devlin and Roeder (1999), does not require information about the genealogy of the population and corrects for population heterogeneity, poor choice of controls, and cryptic relatedness of cases. The validity of genetic association studies depends on the selection of appropriate controls since misclassification
of controls leads to a false-negative result. Controls may be pre-symptomatic or asymptomatic instead of disease-free for disease such as stroke which tends to have an advanced age of onset. Small cerebral infarctions are visible only in CT scans. So, the subjects classified as controls may have asymptomatic disease. Therefore, as with standard epidemiologic studies, phenotypic characterizations of controls and cases are crucial. Care and attention in the design, such as having older controls (i.e., outside of the risk profile) can mitigate this bias. Unlike risk factors in traditional case-control studies, genes do not have the potential for risk factor recall bias.
Population stratification is a form of confounding (Wacholder et al., 2000) that may cause spurious associations in a case-control study when allelic frequencies vary across subpopulations within the study sample.
In association studies, the standard methods for dealing with potential confounding effects are matching
and statistical adjustment. In association studies using DNA pools from many individuals, significant causal disease associations may not be distinguished from associations due to the differences in confounding factors between cases and controls. Thus, matching by confounding factors prior to DNA pooling is essential. Population stratification may be solved by using matched case-control designs which match controls to cases on potential confounding factors such as age, sex, and ethnicity, etc. That is, DNA pools needs to be matched to have similar socio-demographic composition to minimize the risk of spurious associations due to confounding. Then, allele frequency estimates in the matched DNA pools will give a more reliable indication of causal disease association. Statistical methods are proposed to adequately control for population stratification (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al., 2000). When a disorder has known risk factors, it might be desirable to construct multiple pools so that pools differ in the level of exposure to the risk factors. Use of multiple pools incorporating risk factors might increase the power to detect an allelic association with disease. However, it can be argued that the use of multiple pools can be avoided by using the risk factors as covariates in the statistical analysis at the second stage since DNA pooling is used as a screen to be followed by individual genotyping.
Family-based case-control designs can be used to solve the stratification problem. That is, stratification problem can be solved using parents as controls or using
unaffected sibs as controls. However, family-based case-control designs are more expensive than the case-control designs using controls that are unrelated to the cases. Family-based designs will have less power compared to a well-designed study involving unrelated controls.
Genetic case-control study still poses great challenges. One of the difficulties is to obtain the large number of genotypes needed. Additionally, sample sizes of hundreds or even thousands of subjects may be needed to achieve statistical power to detect loci with modest effect size.
Genetic association studies using a set of SNPs that covers the human genome densely would be very expensive, and beyond the reach of most laboratories even though the cost of large-scale single-nucleotide polymorphisms (SNP) determination dropped dramatically in recent years. As a result, the development of innovative study designs that reduce the cost is warranted.
Haplotype-tagging SNPs and DNA pooling have potential to reduce the cost barriers of the genetic association study. Botstein and Risch (2003) provide an excellent review for the comparison of genome-wide haplotype map-based versus sequenced-based strategies. The use of haplotype-tagging SNPs is a map-based approach while DNA pooling is a sequenced-based approach. A comprehensive map-based approach may require genotyping 10-fold more SNPs than a sequence based approach (500,000-1,000,000 for map- based versus 50,000-100,000 for sequenced-based). Bostein and Risch (2003) suggest a sequence-based approach for the initial stage of a major program aimed at genome wide association studies. Bansal etal. (2002) demonstrate the potential of a sequence-based DNA pooling techniques and their associated technologies as an initial screen in the search for genetic association.
The use of DNA pooling technique has been proposed as a means of reducing the number of genotyping reactions required, hence reducing the cost of large-scale case-control association studies. DNA pooling is a powerful and efficient tool for high throughput association analysis. DNA pooling allows measurement of allele frequencies in groups (or pools) of individuals, thereby reducing the number of PCR reactions and genotyping assays dramatically. Therefore, the use of DNA pooling
can significantly reduce the number of genotyping reactions required; hence reducing the cost of large- scale case-control association studies, and offering an approach to this economic impasse. Furthermore, DNA pooling can be an extremely effective method for conserving precious DNA.
The most powerful methods for detecting the association between a marker and a phenotype require individual genotyping. Experimental savings can be achieved by testing allele frequency differences between DNA pools chosen by phenotypic value (Darvasi and Soller, 1994; Barcellos etal, 1997). In DNA pooling, the equal amounts of DNA from each subject are used to create a pool to estimate the allele frequency. That is, DNA of individuals is mixed together to generate a pool before estimating allele frequencies. In general, one pool is created out of all cases and a second out of all controls in a case-control study. Allele frequencies are then estimated in each pool and compared directly between two pools. Various methods have been used for allele frequency estimation, including mass spectrometry, denaturing high-performance liquid chromatography (HPLC), and photo-lithography. Whatever the method of estimating allele frequency is used, allele frequencies are estimated in each DNA pool by quantifying the relative amounts of DNA products representing each allele.
In DNA pooling, allele frequency is measured in a large number of SNP markers in each of the pools as an efficient screen to enrich for SNPs with significant allele frequency differences (Bansal et al., 2002). The SNPs with the large allele-frequency differences in the pooled data are then selected for individual genotyping. The pooled genotyping step reduces the number of SNPs that must be individually genotyped to confirm allele-frequency differences between cases and controls. The larger the sample, the greater the saving, so that the design with minimal genotyping would involve comparing just two pools, each containing DNA from numerous individuals. These two pools could be constituted from cases and controls for a disease trait, or from individuals with trait values at the two extremes of a quantitative trait.
For the quantitative traits such as blood pressure and cholesterol levels, Bader et al. (2001) provide power estimates for two pooled DNA designs which classify the individuals as affected or unaffected, analogous to a case-control design. The optimal design for the quantitative phenotypes is to pool the top and bottom 27% of individuals. This optimal design requires a sample size only 1.24 times larger than that required for individual genotyping when we ignore the experimental measurement error. When a measurement error is included, the pooled
DNA association test serves well as a pre-screen to identify candidate markers which then proceed to individual genotyping. This DNA pooling strategy can still provide a substantial savings over individual genotyping. There are disparate recommendations on the pool sizes in DNA pooling. While Barratt etal. (2002) suggest the pool sizes of 50 or fewer, Mohlke etal. (2002) and Le Hellard etal. (2002) suggest larger pool size of up to 500 cases and 500 controls. Feng et al. (2004) thinks the larger pool sizes are preferable since the quality of frequency estimates does not seem to degrade with larger pool size. Effect of pool sizes on DNA pooling needs to be investigated.
There are some potential error sources in the stages of DNA quantification and formation of pools (Barratt et al., 2002). One of the potential error sources is that alleles with different sequences may not be amplified equally in the competitive reaction. Many high-throughput genotyping platforms measure the amount of each allele from competitive amplification reaction, and determine the genotype based on the abundance of alleles. A standard procedure is to genotype some heterozygotes individually using competitive reaction and then to estimate the average relative speed of amplification reaction from the experimentally. If each allele of an SNP site is equally represented in the assay, the ratio (k) of the amplifications of competing alleles would be one. The frequency of allele A in a pool can be estimated more accurately as A/(A+k • B), where A and B are the measured abundance of alleles corresponding to two polymorphic alleles at the SNP locus. Another potential error sources are due to unequal amounts of DNA per individual and due to experimental errors.
The foundation of successful DNA pooling association test is a precise and accurate estimation of allele frequency. Tang et al. (2004) show that the SNP allele frequency estimates from pooled analysis are comparable to those from individual genotyping. The coefficient of determination (R-square) of the frequency estimates between DNA pooling and individual genotyping is 0.975 using individual heterozygous samples. The quantitative abilities of various platforms used for estimating allele frequencies are confirmed by multiple studies (Sham et al., 2002; Le Hellard etal., 2002; Shifman et al., 2002; Moskvina etal., in press). Mohlke etal. (2002) provide an example of comparison of SNP allele frequency estimates from pooled analysis and from individual genotyping.
Due to the dramatic reduction in the large-scale SNP determination, the cost of SNP determination is now approximately 1 cent per SNP (Feng et al., 2004). Epidemiologic genetic association studies would be still
prohibitively expensive even though the genotyping cost has recently dropped dramatically. Feng et al. (2004) describe that genotyping of 2 million SNPs for 500 cases and 500 matched controls would cost about $20 million. When a single case and a single control pools are used for pooled analysis, the minor allele frequency analysis for 2 million SNPs can be obtained at a genotyping cost of about $40,000. If quadruplicate frequency estimation is obtained from the single case and control pools, genotype cost will be approximately $160,000. DNA pooling technique has the potential to reduce the cost barrier of the large-scale genetic association study.
It is expected that the differences in allele frequencies between cases and controls may be only small in the analysis of complex disease. Therefore, the success of DNA pooling crucially depends on reproducibility and high accuracy in the estimation of allele frequencies of cases and controls. Replicates are needed for each reaction in order to achieve high accuracy.
The most effective use of DNA pooling might be a two-stage design in which markers that show evidence of association from the pooled association are followed up by individual genotyping. Allele frequencies are separately estimated from cases and controls at the first stage, and then individual genotyping will be done for the selected markers with the large allele-frequency differences at the second stage. Thus, pooling can be used as an efficient and sensitive method of screening numerous markers to identify a subset of markers for more detailed studies.
A common observation with the use of DNA pools is that two alleles at a polymorphic SNP locus are not amplified in equal amounts in heterozygous individuals depending on the design of assay. In addition, there are pool-specific errors so that there is variation in the estimates of allele frequencies from different pools that are from the same individuals. As a result of these additional sources of variation, the outcome of an experiment is an estimated count of alleles rather than the usual outcome in terms of observed counts.
In the first stage, the simple method to analyze SNP-based case-control association studies using DNA pooling is to multiply the estimated allele frequencies by the number of chromosomes in the pool and perform a chi-square test. This simple chi-square test has been used in several published studies (Shifman etal., 2002; Williams etal., 2002; Norton etal., 2003,2004). Visscher and Le Hellard (2003) and Le Hellard et al. (2002) show that simply substituting estimated count for observed
count can lead to an inflation of type 1 error rates due to unequal amplification in DNA pooling. Adjustment was made to account for unequal amplification by various researchers (Mohlke et al. 2002; Yang et al., 2005; Visscherand Le Hellard, 2003). Visscherand Le Hellard (2003) modify the standard chi-square test by incorporating the variation inflation factor to control the type I error rate in the presence of experimental variation.
As an alternative approach for the analysis of pooled data, Visscher and Le Hellard (2003) suggest an over dispersed model, which is widely used for the analysis of clustered binary data in statistical literatures (Ahn etal., 2003; Jung et al., 2001; Kang et al., 2003). The parameters from the over-dispersed model can be estimated from a nested design of population samples and replicated pools within samples. Kang etal. (2004) investigate the allelic chi-square test used in genetic association studies in terms of empirical type I error rates and empirical powers.
If the large allele-frequency differences are observed in the pools of cases and controls, genotyping is performed individually. One of the important issues in DNA pooling study is the selection of markers to be genotyped individually. Development of statistical methods is needed to determine methods for selecting markers to proceed to the second stage. Konig and Ziegler (2004) propose decision-theoretic models to determine individual genotyping based on the results from pooled DNA. Bonferroni correction, false discovery rate (FDR) or related concepts can be also used for the selection of markers to move to the second stage. In the second stage, genotyping is performed individually and markers are analyzed conventionally. A more comprehensive control of confounding factors can be done at the second stage. The second stage can enable the study of gene-gene interactions and gene-environment interactions.
The importance of correcting for multiple comparisons in genomic screens is well known (Lander and Kruglyak, 1995). Benjamini and Hochberg (1995) introduce the false discovery rate (FDR) for multiple testing situations. Sabatti et al. (2003) show that the simple step-down procedure introduced by Benjamini and Hochberg (1995) controls the FDR for the dependent tests on which association genome screens are based through simulation.
Even though complex diseases are generally caused by multiple genetic variations, most available association methods are based on the assumption that a single genetic variation is primarily responsible for the disease under study. Only a few approaches consider interactions of multiple genes and environmental factors in identifying
susceptibility loci for complex disease (Hoh et al., 2001; Ritchie etal., 2001; Nelson etal., 2001; Kim etal., 2003; Hao etal., 2004). Hoh etal. (2001) develop a novel test procedure, called a set association approach, to identify genetic variation responsible for complex diseases when multiple genes are involved. This approach is appropriate for many study designs, such as case-control, trio and extended families. The method uses a score statistic that is weighted by the allele contribution to a Hardy-Weinberg equilibrium measurement. All alleles are jointly estimated and the minimal p-value identifies the combination of alleles, across genes that appear to act in concert to alter the risk of disease. They applied the set association approach to a real restenosis data set and could identify several SNPs of interest that were in linkage disequilibrium with susceptibility locus for restenosis, and the re-blockage of the coronary after treatment. Zee etal. (2002) use this approach to define a panel of contributory genes in instant restenosis. Hoh etal. (2001) did not evaluate the empirical the type I errors and empirical powers of the set association approach. Hao et al. (2004) systematically evaluate the performances of multiple SNP association test in terms of power and accuracy in capturing the real disease SNPs. Hao etal. (2004) show that the inclusion of Hardy-Weinberg Disequilibrium (HWD) reduces the power through simulation. They demonstrate that the test procedure could capture the SNPs associated with disease fairly successfully.
Romero et al. (2002) propose guidelines for the evaluation of reports of genetic association studies including selection of SNPs, study design, assay characteristics, sample size determination, multiple tests and statistical analysis. Proposed guidelines for the reporting of genetic association studies facilitate the peer review process, publication, and availability of the data for future studies and systematic reviews.
DNA pooling technique is ideal for screening a large number of markers for associations although positive results will require confirmation through individual genotyping. Considerable savings can be achieved concerning DNA, cost and labor through the use of DNA pooling. All markers identified in the initial discovery process are subjected to a follow-up program. Additional work is necessary to eliminate false-positive associations that are likely due to sampling errors and potential population substructures. The most powerful method is the application of observations in one or more independent samples. That is, replicate studies using pooled DNA from two or more studies are expected to
play an efficient role in eliminating false positive findings, and for the efficient identification of meaningful association.
Considerable cost reduction can be achieved through the use of DNA pooling, whereby DNA samples from multiple individuals are pooled before genotyping. It is suggested that DNA pooling should be considered especially in the initial stages of a major program aimed at genome-wide association studies since large-scale association studies can be accelerated with the use of DNA pooling.
|
Annnotations
- Denotations: 0
- Blocks: 0
- Relations: 0