Results Additional mQTL Associations Identified with the Illumina EPIC Array An overview of our study design is presented in Figure S1. We tested 5,210,475 imputed genetic variants against the 766,714 DNAm sites that were on the Illumina EPIC array and that passed our stringent QC criteria (see Subjects and Methods). We identified 12,689,548 significant mQTL associations (we used a conservative Bonferroni-corrected threshold of p < 6.52 × 10−14) between 2,907,234 genetic variants and 93,268 DNAm sites (Table S1; Figure 1A); there was a mean percentage point change in DNAm per additional reference allele of 3.46% (SD = 3.01%) across all mQTL-associated sites. Existing mQTL databases have been almost exclusively generated with the Illumina 450K array; more than half of the DNAm sites (n = 48,099, 51.6%; Table S2) that we identify as being associated with genetic variation with the Illumina EPIC array involve additional content not previously interrogated (Figure S2). Importantly, these additional mQTL associations are annotated to 5,172 genes not included in mQTL databases generated with the Illumina 450K array (Figure S3). DNAm sites associated with genetic variation are associated with a median of 65 (interquartile range = 22–162) mQTLs, probably reflecting linkage disequilibrium (LD) relationships between proximal variants. In contrast, each mQTL variant is associated with a median of two (interquartile range = 1–5) DNAm sites, and the majority of mQTL SNPs (n = 1,003,238, 34.5%) are associated with DNAm at only a single site (Figure S4). We performed LD clumping of the results for each DNAm site to identify the number of independent associations for each DNAm site (see Subjects and Methods); this process reduced the number of mQTL associations (p < 6.52 × 10−14) to 161,761 (1.27% of the total number of unclumped significant mQTL associations); a median of 1 (interquartile range = 1–2) mQTL variant associated with each DNAm site (Figure S5). At a more relaxed “discovery” threshold (p < 1 × 10−10), we identified a total of 17,051,673 mQTL associations between 3,281,391 genetic variants and 114,595 DNAm sites; these results are available in a searchable database (see Web Resources). Figure 1 DNA-Methylation Quantitative-Trait Loci Are Predominantly cis-Acting and Enriched in Sites at Which DNAm Is Highly Heritable (A) The genomic distribution of Bonferroni-significant (p = 6.52 × 10−14) mQTLs in whole blood; the position on the x axis indicates the location of Illumina EPIC array probes, and the position on the y axis indicates the location of genetic variants. The color of the point corresponds to the difference in DNA methylation per allele compared to the reference allele; the largest effects are plotted in dark red. A clear positive diagonal can be observed, demonstrating that the majority of mQTLs are associated with genotype in cis. (B) A bar plot of the percentage of DNA-methylation sites associated with common genetic variation and grouped by previous reported estimates of heritability (percent variation in DNAm is explained by additive genetic factors taken from van Dongen et al.13). Each bar plot demonstrates the percentage of DNA-methylation sites with Bonferroni significant genetic effects in cis only (blue), trans only (green), and both cis and trans (red) and with no significant genetic effects (white). mQTL Associations Predominantly Occur in cis and Influence DNAm at Sites Known to Be Influenced by Heritable Factors Consistent with the results of previous studies, we found that the majority of mQTL associations (n = 11,679,376; 92%) occur in cis, defined as situations where the distance between mQTL SNP and DNAm site is ≤500 kb17, 18, 20, 22 (Figure 1A). Cis mQTL variants are typically associated with larger effects on DNAm than those acting in trans (average cis effect = 3.48% change in DNAm per allele, average trans effect = 3.26% change in DNAm per allele; Mann-Whitney p < 2.23 × 10−308) (Figure S6). Furthermore, among cis mQTL associations, both significance and effect size increase as the distance between the genetic variant and DNAm site decreases (Figure S7). Compared to other tested DNAm sites, those associated with at least one mQTL variant (after correction for the number of tests performed [see Subjects and Methods], p < 6.52 × 10−14) are significantly enriched in intergenic regions and less likely to be located within both gene bodies (Chi square test: p < 2.23 × 10−308; Figure S8; Table S3A) and CpG islands (Chi square test p < 2.23 × 10−308; Figure S9; Table S3B). We used quantitative genetic data from a study of DNAm in monozygotic and dizygotic twins13 to show that DNAm at sites associated with at least one mQTL variant is more strongly influenced by heritable (additive genetic) factors than are other tested DNAm sites (mQTL sites: median heritability, h2, = 55% [interquartile range = 38%–71%]; all DNAm sites: median h2 = 12% [interquartile range = 5%–31%]; Mann-Whitney p < 2.23 × 10−308; Figure S10). Overall, the proportion of sites at which DNAm is associated with an mQTL variant increases as a function of the estimated additive genetic influence derived from twin analyses (Figure 1B). Interestingly, there is no significant difference in the contribution of additive genetic effects to variance in DNAm at sites associated with cis (median h2 = 56%; interquartile range = 39%–72%) and trans (median h2 = 57%; interquartile range = 32%–76%) mQTL variants (Mann-Whitney p = 0.910). Proximal DNA-Methylation Sites Share Genetic Associations Similar to the LD relationships that exist between proximal genetic variants, DNAm levels are often correlated between proximally located DNAm sites.14, 46 To further characterize the genetic architecture of DNA methylation, we investigated whether shared genetic effects on multiple DNAm sites underlies this regional correlation structure. Although genetic variants are often associated with variation at multiple DNAm sites (Figure S4), this does not establish a shared genetic effect; shared genetic signals influencing a pair of DNAm sites might result from two distinct causal genetic variants that are in strong LD. To formally test whether neighboring DNAm sites are influenced by the same causal variant, we used a Bayesian co-localization approach44 to interrogate all pairs of DNAm sites characterized as being located within 250 kb of each other and associated with at least one significant mQTL variant at our “discovery” significance threshold (p < 1 × 10−10). Our analyses assessed 3,535,812 pairs of DNAm sites with a median distance between DNAm sites of 110,493 bp (interquartile range = 47,914–178,085) and compared the pattern of mQTL associations for both DNAm sites to test whether they index an association with either the same causal variant or two distinct causal variants. We found that the posterior probabilities for virtually all of these (n = 3,520,781 [99.6%], median distance of 110,319 bp [interquartile range = 47,803–177,948]) supported a co-localized association within the same genomic region (PP3 + PP4 > 0.99). Of these, 281,898 pairs (8%) had sufficient support for the association of both DNAm sites with the same causal mQTL variant (PP3 + PP4 > 0.99 and PP4/PP3 > 1; Table S4); 234,460 pairs (6.6%) had “convincing” evidence (PP3 + PP4 > 0.99 and PP4/PP3 > 5) for co-localization of the same mQTL association according to the criteria of Guo and colleagues.47 DNAm sites that shared genetic effects with at least one other DNAm site co-localize with a median of three other DNAm sites, indicating a complex relationship between genetic variation and DNAm in cis. Figure 2, for example, demonstrates that chromosome 9 contains a broad genomic region (>400 kb) where 38 DNAm sites—spanning seven genes—have a common underlying genetic signal. Of note, these DNAm sites are not contiguous; a small number of genetically mediated DNAm sites located within this region do not share the same mQTL signal. Pairs of DNAm sites with a shared causal mQTL variant are enriched for concordant directions of effect (71.2% pairs with positive correlations versus 28.8% pairs with negative correlations, binomial test p = 1.48 × 10−323; Figure S11). Furthermore, these pairs are located relatively close together (median distance between convincing co-localized pairs = 12,394 bp [interquartile range = 1,004–49,110]), with evidence that the shared genetic architecture is structured around annotated genomic features. Co-localized pairs of DNAm sites are significantly more likely to be annotated to the same gene (OR = 6.08, Fisher’s test p < 2.23 × 10−308) or CpG island (OR = 1.54, Fisher’s test p < 2.23 × 10−308) than non-co-localized pairs. Where pairs of DNAm sites with a shared genetic signal are annotated to the same gene, they are nominally less likely to be annotated to the same feature than are pairs of DNAm sites annotated to different genes (OR = 0.956, Fisher’s test p = 2.52 × 10−7), suggesting that where genetic variation influences DNAm at multiple sites across a gene these sites do not necessarily cluster by genic feature and can be located anywhere from the transcription start site to the end of the last exon. DNAm is more likely to be positively correlated between pairs of co-localized sites annotated to the same gene than between pairs of sites annotated to different genes (OR = 1.85, Fisher’s p < 2.23 × 10−308), a result driven predominantly by pairs of DNAm sites annotated to the same feature within that gene (OR = 1.57, Fisher’s test p = 3.41 × 10−135) rather than those annotated to different features within a gene. Finally, pairs of DNAm sites with shared genetic effects annotated to the same genic feature, although not necessarily the same gene, are more likely to be positively correlated than pairs annotated to different genic features (OR = 1.73, Fisher’s p < 2.23 × 10−308; Figure S12). Figure 2 Shared Genetic Architecture between Neighboring DNA-Methylation Sites Heatmap of Bayesian co-localization results for all pairs of DNA-methylation sites with at least one significant mQTL (p < 1 × 10−10) in a genomic region on chromosome 9 (chr9:124783559–125216341). Columns and rows represent individual DNA-methylation sites (ordered by genomic location). The color of each square indicates the strength of the evidence for a shared genetic signal (from yellow [weak] to red [strong]); this strength is calculated as the ratio of the posterior probabilities that they share the same causal variant (PP4) compared to two distinct causal variants (PP3). The ratio was bounded to a maximum value of 10; gray indicates pairs of DNA-methylation sites that were not tested for co-localization. DNAm QTL Have Utility for Refining GWAS Signals for Complex Traits Genetic variants identified in GWAS analyses rarely index protein-coding changes. Instead, they are hypothesized to influence gene regulation because they are enriched in regulatory motifs, including enhancers and regions of open chromatin.24, 26 There is considerable interest in using regulatory QTLs to refine genetic association signals and prioritize potentially causal genes within the extended genomic regions identified in GWASs.17, 27, 48, 49 We next extended our previous application of the SMR approach28 to test 126,457 DNAm sites identified at our “discovery” mQTL threshold (p < 1 × 10−10) against 63 complex phenotypes with GWAS data (Table S5). The first stage of the SMR approach uses the most significantly associated mQTL SNP—that has also been tested in the GWAS dataset—as an instrumental variable and implements a two-step least-squares (2SLS) approach to compare the estimated associations. Using this approach, we identified 5,848 associations (p < 3.95 × 10−7 corrected for 126,457 DNAm sites) between 40 complex traits and 5,849 unique DNAm sites (Figure S13). Because the associations identified in this way potentially reflect two highly correlated but different causal variants for the GWAS trait and DNAm, the second stage of the SMR method repeats the analysis with alternative mQTL SNPs as the instrument. If there is a single causal variant associated with both the phenotype and DNAm, the association statistics will be identical regardless of the selected instrument. In contrast, if there are two separate causal variants, each correlated with the instrument, there will be variation in the results. To distinguish between these scenarios, we applied the heterogeneity in dependent instruments (HEIDI) test to select associations with non-significant heterogeneity (HEIDI p > 0.05) and identified a refined set of 1,662 associations between 36 complex traits and 1,246 DNAm sites (Table S6). Because the power of the SMR approach to detect pleiotropic associations reflects, in part, the power of the initial complex-trait GWAS, it is unsurprising that the highest number of SMR associations was found for traits characterized by the largest number of GWAS signals, such as height (423 significant GWAS loci, 506 SMR pleiotropic associations)50 and inflammatory bowel disease [MIM: 266600 ] (168 significant GWAS variants, 127 SMR pleiotropic associations).51 In contrast, no SMR associations were found for traits with few or no genome-wide-significant SNPs; such traits included parental age at death (0–1 significant GWAS variants),52 insulin secretion rate (no significant GWAS variants),53 and whether a person has ever smoked (no significant GWAS variants).54 We compared our SMR results to those obtained with our previous mQTL dataset—generated from a smaller number of samples—and observed high rates of replication for loci that were tested in both analyses. Because our previous SMR analysis was based on a subset of 43 traits and the reduced content of the Illumina 450K array, 842 pleiotropic associations reported in the current analysis were taken forward for replication; DNAm at 519 (33.0%) of these was associated with an mQTL variant, and therefore these associations had been tested in our previous SMR study; 268 (51.6%) were characterized by significant pleiotropic association in both studies. Furthermore, the vast majority of associations tested in both datasets (516; 99.4%) were in the same direction; this was significantly more than would be expected by chance (sign test p = 2.72 × 10−149; Figure S14), suggesting that there are many additional true signals in those that did not meet the stringent criteria for significance used in both studies. In order to prioritize genes for each complex trait, we characterized the genic location of associated DNAm sites. 1,269 (76.3%) of the identified pleiotropic associations involve DNAm sites located either within a gene or less than 1500 bp from the transcription start site; this rate is significantly higher rate than that for all DNAm sites tested in our SMR analysis (OR = 1.64, Fisher’s test p = 1.12 × 10−18). To further explore these 786 pleiotropic associations—occurring between 577 genes and 32 complex traits—we extended our SMR analyses to incorporate a publicly available whole-blood gene eQTL (n = 5,311 individuals) dataset.45 Expression of 232 (40.2%) of our identified genes was significantly associated with an eQTL variant, and we used these to test for pleiotropic associations between gene expression level and the GWAS trait. These analyses provided additional support for 138 of the pleiotropic associations identified with mQTL data, supporting a relationship between 33 genes and 17 complex traits (Table S7). Figure 3, for example, highlights an association between the regulation and expression of LIME1 and Crohn disease [MIM: 266600]; this association is supported by SMR analyses incorporating both mQTL and eQTL data. Figure 3 Summary-Data-Based Mendelian Randomization (SMR) Analysis Using Quantitative Trait Loci Associated with DNA Methylation (mQTL) and Gene Expression (eQTL) Implicates a Role for LIME1 in Crohn Disease Shown is a genomic region on chromosome 20 (chr20: 62335000–62371000) identified in a recent Crohn disease GWAS performed by Liu et al.51 Genes located in this region are shown at the top, exons are indicated by thicker bars, and the red arrows indicate the direction of transcription. The scatterplot depicts the –log10 p value (y axis) against genomic location (x axis) from the SMR analysis (where circles represent Illumina EPIC array DNA-methylation sites, squares represent gene expression probes, and solid green and red highlight those with a non-significant HEIDI test for DNA methylation and gene expression, respectively). The green and red horizontal lines represent the multiple-testing corrected threshold for the SMR test using mQTL and eQTL, respectively. Pleiotropic Associations between DNAm and Gene Expression Although it is widely hypothesized that DNAm influences gene expression, its relationship with transcriptional activity is not fully understood. DNAm across CpG-rich promoter regions, for example, is often assumed to repress gene expression via the blockage of transcription-factor binding and the attraction of methyl-binding proteins.55 DNAm in the gene body, in contrast, is hypothesized to be a marker of active gene transcription5, 56 and to potentially play a role in regulating alternative splicing and isoform diversity. To identify associations between DNAm and gene expression, we applied the SMR approach to DNAm sites identified as being associated with an mQTL at our “discovery” significance threshold, located within a megabase of a gene expression probe included in the eQTL dataset generated by Westra and colleagues.45 In total, we tested 488,342 pairs and explored relationships between 96,694 DNAm sites and 4,721 gene expression probes annotated to 4,049 genes (Figure S15). On average, each DNAm site was tested against a median of four expression probes (interquartile range = 2–7) mapping to a median of three genes (interquartile range = 2–6). In contrast, each expression probe was tested against a median of 85 DNAm sites (interquartile range = 56–130). Of these, 40,404 pairs (8.27%)—comprising 22,007 (22.8%) DNAm sites and 4,201 (89.0%) expression probes mapping to 3,628 (89.6%) genes—were characterized by a significant SMR result (significance threshold corrected for the number of DNAm sites and gene expression probe pairs tested = p < 1.02 × 10−7). 6,798 of these significant SMR pairs—comprising 5,420 (5.61%) DNAm sites and 1,913 (40.5%) expression probes mapping to 1,702 (42.0%) genes—also had a HEIDI p > 0.05 (Table S8; Figure S15). These results suggest that although expression of a large proportion of genes is associated with DNAm sites, not all DNAm sites are associated with gene expression in cis. The majority of significant gene expression probes (n = 1,192; 62.3%) are associated with a median of two DNAm sites (interquartile range = 1–4) spanning a median distance of 66,846 bp (interquartile range = 19,062–155,737) at a median density of 19,959 bp (interquartile range = 6,387–54,445) between sites. Interestingly, DNAm sites pleiotropically associated with gene expression are enriched in the gene body and transcription start sites of genes and depleted intergenically (Chi square test p = 7.08 × 10−133; Figure S16; Table S9). We identified a small but significant enrichment of scenarios where DNAm is negatively associated with gene expression at sites located in the 5′ UTR (mean effect = −0.0211; p = 0.00108), TSS200 (mean effect = −0.0479; p = 6.38 × 10−7), TSS1500 (mean effect = −0.0350; p = 5.82 × 10−11) and 1st exon (mean effect = −0.0506; p = 6.19 × 10−5), consistent with the hypothesis that promoter DNAm often represses gene expression (Figure S17). Using QTL Data to Refine the Genic Annotation Associated with DNAm Sites A key challenge in epigenetic epidemiology relates to the genic annotation of DNAm sites; such annotation is critical for the biological interpretation of significant EWAS associations. DNAm sites are usually annotated to specific genes on the basis of proximity, although the extent to which this approach is valid for inferring downstream transcriptional effects is not known. Among the identified pleiotropic associations between DNAm and gene expression, we selected instances where the DNAm site is not intergenic—i.e., <1500 bp from the transcription start site of a gene (n = 5,593 [82.3%)]—and found that these were annotated to the same gene whose expression level they were associated with at a much higher rate than were DNAm sites significantly associated with expression levels at another gene (OR = 9.67; Fisher’s test p < 2.23 × 10−308). Of the 5,460 DNAm sites significantly associated with expression of at least one gene, 1,790 (32.8%) were associated with the gene they were annotated to, although 276 (5.05%) of these were also associated to an additional gene and 2,686 (50.0%) were associated with a different gene. Of note, not all CpGs were tested against the gene they were annotated to because the gene lacked a significant eQTL; this was the case for the majority of DNAm sites (n = 2,701; 80.4%) identified as being associated with a gene other than the one they were annotated to. Of particular interest are the 944 (18.3%) intergenic sites that are associated with gene expression; these potentially enable additional gene annotations for interpreting the results of EWAS analyses. Overall, although the proximity-based annotation of DNAm sites appears to be appropriate in many instances, we identified notable exceptions. For example, Figure 4A shows that the DNAm site cg00331210, located within the body of NARFL on chromosome 16, is not associated with expression of that gene but with the FAM173A gene, which is located 7.9 kilobases away. Likewise, Figure 4B shows that the DNAm site cg00072720, located within the gene body of CLDN7, is not associated with expression of that gene but with that of two other genes (ACADVL and ELP5/C17ORF81) on chromosome 17. Figure 4 Regional Plots Demonstrating the Complex Relationship between Gene Expression and DNA Methylation as Identified by SMR Shown is an example of (A) a DNA-methylation site (cg00331210) that is associated with expression of a gene (FAM173A) that is not the most proximal to it and (B) a DNA-methylation site (cg00072720) associated with the expression of multiple genes (CLDN7 and ELP5). Each plot contains a gene track, where red arrows indicate the direction of transcription and a red diamond indicates the position of the pleiotropically significant DNA-methylation site. Circles and squares indicate the location of the gene expression probes that DNA-methylation sites were tested against. Color indicates the significance level of the SMR test (black to gray), and green indicates significant associations (p < 1.02 × 10−7). For significant associations, squares indicate tests that have non-significant heterogeneity (p > 0.05) and are indicative of pleiotropic associations.