Detecting disease-associated genotype patterns Abstract Background In addition to single-locus (main) effects of disease variants, there is a growing consensus that gene-gene and gene-environment interactions may play important roles in disease etiology. However, for the very large numbers of genetic markers currently in use, it has proven difficult to develop suitable and efficient approaches for detecting effects other than main effects due to single variants. Results We developed a method for jointly detecting disease-causing single-locus effects and gene-gene interactions. Our method is based on finding differences of genotype pattern frequencies between case and control individuals. Those single-nucleotide polymorphism markers with largest single-locus association test statistics are included in a pattern. For a logistic regression model comprising three disease variants exerting main and epistatic interaction effects, we demonstrate that our method is vastly superior to the traditional approach of looking for single-locus effects. In addition, our method is suitable for estimating the number of disease variants in a dataset. We successfully apply our approach to data on Parkinson Disease and heroin addiction. Conclusion Our approach is suitable and powerful for detecting disease susceptibility variants with potentially small main effects and strong interaction effects. It can be applied to large numbers of genetic markers. Background It is now generally accepted that complex genetic traits such as diabetes and schizophrenia are under the influence of multiple interacting loci and environmental triggers, each with a possibly small effect. Thus, to overcome the limitations of traditional single-locus association analysis (looking for main effects of single marker loci), various methods have been developed to investigate the joint disease association of multiple markers. As an approximation to multivariate analysis [1], sums of single-locus association statistics [2] have proven to be efficient, powerful [3], and applicable to large numbers of markers. A more direct approach is to investigate whether sets of specific alleles or genotypes at different loci occur more frequently in case than control individuals [4,5]. Haplotypes are sets of alleles, one each at different loci, on a chromosome. Each individual has two haplotypes but, because of unknown phase, it is generally not possible to identify the two specific haplotypes in a given individual. For this reason, we prefer to work with sets of genotypes at different loci (diplotypes). To allow for possible interactions between any genes, we consider genotypes at different loci, wherever these occur in the genome, and refer to such sets of genotypes as genotype patterns. The simplest situation to consider is that of two genetic markers. Throughout this paper we will focus on single nucleotide polymorphisms (SNPs). One pair of SNPs, each with 3 genotypes, comprises a maximum of 9 genotype patterns. A possible strategy is to investigate all pairs of SNPs to see for each pair whether the genotype pattern frequencies are different in case and control individuals [6]. Several diseases have been described requiring a mutation at two different loci while occurrence of a mutation at only one locus does not lead to disease [7]. At least in experimental organisms, researchers have developed methods to specifically search for epistatic interactions between two loci, with the aim to detect networks of interacting loci [8]. In human genetics, several approaches to detecting the joint effects of multiple susceptibility variants have been proposed [9-14]. However, many of these methods are applicable only to small numbers of SNPs and are not suitable for genome-wide analysis of 100,000s of markers. For this reason, two-step approaches are more promising [2,15,16]. Specifically, for two-locus disease models, a two-step approach has been proposed that initially selects SNPs based on significant single-locus tests, with logistic regression analysis of all possible main and interaction effects at the second stage [16]. Here, we proceed in a similar manner but with important differences: initial selection of SNPs is also carried out based on their single-locus test results but irrespective of whether these are significant, and the number of these test SNPs to be carried forward for second-stage analysis may be varied from 1 up to any desired maximum number, m, limited only by computer resources. Instead of logistic regression analysis, we propose to test whether for a given number m of test SNPs the frequencies of genotype patterns is different in case and control individuals. It should be noted that genotype pattern frequencies comprise main effects (at one or multiple loci) and epistatic interaction effects. Here our focus is on genotype patterns with an interest in finding pattern frequencies reflecting a high degree of epistatic interaction so that disease-associated SNPs may be found even though they contribute to disease more through interaction effects than direct main effects. Such genes have been found, for example, in barley [17], and so-called digenic traits are known in humans [7]. Methods Pattern recognition approach Consider a dataset with a certain number of case and control individuals, each genotyped at a number M of SNPs, where M might be equal to 1 million, say. We want to find disease-associated patterns of m SNPs, m < 0 (i > 0), the parameter a0 is chosen in such a way that the model predicts a disease prevalence of K = 0.05. For example, this prevalence is predicted by parameter values a1 = a2 = 0, a3 = 3, and a0 = -5.05. Note that such a model containing only a second-order interaction effect does induce some direct genotype effects at each of the three disease loci. Depending on the genotype pattern, these parameter settings lead to odds ratios (ORs) for disease ranging from 0.0034 through 3.12 and, at each single locus, lead to ORs of 0.58 and 1.71 for the respective genotype codes xi = -1 and xi = +1. Initially, we only consider md = 3 SNPs in complete association with disease susceptibility variants. This way we directly compare power between the pattern and single-locus tests. Subsequently we also consider the effects of unassociated SNPs and how they degrade power. Results Simulation with disease SNPs Power calculations in this section were carried out for sample sizes of 50 case and control individuals each. For a pure main-effects model (a1 > 0, a2 = a3 = 0), Figure 1 shows that pattern and single-locus approaches have virtually the same power. This is to be expected: Disease SNPs are assumed uncorrelated and each acts independently. On the other hand, in the presence of second-order interaction effects (a1 = a2 = 0, a3 > 0), with direct single-locus effects only present inasmuch as they are induced by interaction effects (see above), Figure 2 demonstrates the tremendous power advantage of the pattern approach over the single-locus approach. Figure 1 Power (Y-axis) as a function of main effects (x-axis). For a pure main-effects model (x-axis = a1 > 0; interaction effects a2 = a3 = 0 are all zero), Figure 1 shows that pattern (solid red line) and single-locus (broken blue line) approaches have virtually the same power. Figure 2 Power (y-axis) as a function of interaction effects (x-axis). With strong second order interaction effects (x-axis = a3; a1 = a2 = 0), the pattern approach (solid red line) is much more powerful than the single-locus test (broken blue line). Simulation with disease and unassociated SNPs With md susceptibility SNPs and M unassociated SNPs (total of md + M SNPs), we need to define "success" in our simulations. In a given randomization sample, it is no longer sufficient that p'm = pm. In addition, we require that the particular subset of m = 3 ascertained SNPs with highest chi-square values comprise at least one disease SNP. An analogous rule applies to the single-locus test. For 100 case and controls each, with a1 = 1, a2 = 0, and a3 = 6, Figure 3 shows the deleterious effects of "noise" SNPs. In the presence of M = 1500 unassociated SNPs, power drops from 100% to 20%, where power loss is somewhat smaller for the pattern than the single-locus approach although both suffer greatly. This power loss may be offset by using additional observations. With the same parameter values, to maintain a power of 80%, Figure 4 shows that for the pattern method an increase in sample size from 100 to 180 is sufficient to offset the harmful effects of "noise" SNPs. As the number of unassociated SNPs increases, required sample sizes initially increase strongly but later less so. The required sample size increase is virtually the same for the two association tests considered here so that only one curve is shown in Figure 4. Figure 3 Power (y-axis) as a function of the total number, M, of SNPs. Figure 3 shows that power is strongly reduced by the presence of random (unassociated) markers. For 100 cases and controls each, under a model with main and interaction effects (a1 = 1, a2 = 0, a3 = 6), pattern approach (solid red line) and single-locus approach (broken blue line) suffer approximately the same power losses. Figure 4 Sample sizes (numbers of cases and controls, y-axis) for power of 80%. The power loss due to increasing numbers M of unassociated SNPs may be offset by increased sample sizes. Figure 4 shows the numbers of cases and controls each (y-axis) that are necessary to maintain power of 80% in the presence of M of unassociated SNPs (x-axis). Parameter values are as in Figure 3. Pattern approach (solid red line) and single-locus approach (broken blue line) furnish virtually identical results. Choice of number Of SNPs for analysis So far, we analyzed data looking for patterns of length m = 3 while the data were generated with md = 3. In reality, of course, the number md of disease variants is unknown. Thus, for our epistatic disease model (a1 = a2 = 0, a3 > 0, md = 3, no other SNPs, 100 cases and controls each), we estimated power for different numbers m of best SNPs considered. Not unexpectedly, power is highest when m = md (Figure 5). Also, power is better when m > md then when m 0) with three disease SNPs, md = 3 (no other SNPs, 100 cases and controls each), power (y-axis) is displayed as a function of the interaction coefficient, a3 (x-axis), for different numbers m of test SNPs. Power is highest when m = md. Analysis of Parkinson's Disease dataset A dataset with approximately 540 case and control individuals and approximately 408,000 SNPs genome-wide is available for re-analysis [20]. For our purpose, we chose to work with the 19,494 SNPs on chromosome 11 that passed our quality control measures; SNP rs10501570 on that chromosome was previously associated with Parkinson's Disease (PD) [20]. We decided to work with a subset size of m = 3 SNPs. In chi-square tests for 2 × 3 tables (case/control phenotype versus three SNP genotypes), the three largest values occurred for SNPs rs12364577, rs1377470, and rs10501570. In 10,000 randomization samples, the largest chi-square had an associated significance level of 5/10,000 = 0.0005 (corrected for multiple testing) while the other two chi-square values had associated p-values of 0.0023 and 0.0030. Forming genotype patterns for these three SNPs, we found five patterns with frequencies of at least 0.05 in either case or control individuals; the remaining patterns were combined into a "rare" class. The chi-square for the resulting 2 × 6 table had an associated randomized significance level of 0.0064. Evidently, for this dataset the single-locus test was more effective than our pattern test (0.0005 < 0.0064), presumably because of strong single-locus effects. Analysis of a dataset for heroin addiction In 104 former severe heroin addicts and 101 control individuals (all Caucasians), a case-control association analysis was carried with 10,000 SNPs [21]. Single-locus analysis furnished nonsignificant results: The largest chi-square in 2 × 3 tables of genotypes had an associated p-value of 0.15 (obtained in 10,000 randomization samples). On the other hand, evaluating genotype patterns for the three SNPs with individually smallest p-values revealed five common patterns, and the resulting 2 × 6 table of pattern frequencies had an associated experiment-wise significance level of 0.026. As reported in [21], a specific genotype pattern was associated with heroin addiction with an odds ratio of 6.25 while none of the component SNPs by themselves were significantly disease-associated. Interestingly, when we investigated genotype patterns comprising two, four, or five SNPs, the significance levels associated with the corresponding pattern tables were all in excess of 0.30. Thus, the specific set of three genetic variants identified seems to predispose to disease [21]. Discussion We report here a new multi-locus method for genetic case-control association analysis that jointly evaluates main and interaction effects of multiple genetic variants. In contrast to our Set Association method developed previously [2], it is not based on main effects of individuals SNPs only. Our method is able to find epistatic interaction effects although the test SNPs chosen for analysis are determined by SNP-specific association statistics. Clearly, if SNPs have weak individual effects then they are unlikely to be picked up by our approach. In an application to a study of former heroin addicts, our genotype pattern method clearly showed its advantages over standard single-locus methods. We implemented our approach in a computer program, RandomPat, which is available on request and will soon be available for downloading from our website [22]. In its current implementation, our method has a shortcoming in that it cannot handle missing observations. Thus, for m SNPs it will ignore patterns that contain one or more missing genotypes. A major advantage of our approach is that the number, m, of test SNPs can be chosen by the investigator. Varying m from 1 up to a suitable upper limit will show, which number of test SNPs is most significant in a given dataset. As demonstrated in the previous section, such an approach is expected to estimate the number of disease variants if they are all of equal strengths. Conclusion As we demonstrated, our method is clearly superior to the traditional approach of looking for single-locus effects. In addition, it is suitable for estimating the number of disease variants in a dataset. On the other hand, purely epistatic loci (no main effects by themselves) cannot be detected by our approach. Competing interests The authors declare that they have no competing interests. Authors' contributions Quan Long contributed to the concept of our approach and wrote the computer program. Qingrun Zhang participated in the development of the method and contributed to programming. Jurg Ott participated in methods development and wrote the manuscript.