PMC:6323551 / 4928-5884 JSON TXT

Human-Disease Phenotype Map Derived from PheWAS across 38,682 Individuals Abstract Phenome-wide association studies (PheWASs) have been a useful tool for testing associations between genetic variations and multiple complex traits or diagnoses. Linking PheWAS-based associations between phenotypes and a variant or a genomic region into a network provides a new way to investigate cross-phenotype associations, and it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy. We created a network of associations from one of the largest PheWASs on electronic health record (EHR)-derived phenotypes across 38,682 unrelated samples from the Geisinger’s biobank; the samples were genotyped through the DiscovEHR project. We computed associations between 632,574 common variants and 541 diagnosis codes. Using these associations, we constructed a “disease-disease” network (DDN) wherein pairs of diseases were connected on the basis of shared associations with a given genetic variant. The DDN provides a landscape of intra-connections within the same disease classes, as well as inter-connections across disease classes. We identified clusters of diseases with known biological connections, such as autoimmune disorders (type 1 diabetes, rheumatoid arthritis, and multiple sclerosis) and cardiovascular disorders. Previously unreported relationships between multiple diseases were identified on the basis of genetic associations as well. The network approach applied in this study can be used to uncover interactions between diseases as a result of their shared, potentially pleiotropic SNPs. Additionally, this approach might advance clinical research and even clinical practice by accelerating our understanding of disease mechanisms on the basis of similar underlying genetic associations. Introduction Pleiotropy occurs when a given locus (e.g., a SNP or gene) influences two or more different phenotypes or traits. The phenome-wide association study (PheWAS) is an important tool that has the strength to identify associations between genetic variants and clinical phenotypes and also the potential to reveal pleiotropic associations among diseases.1, 2, 3, 4 Although pleiotropy often refers to a common molecular mechanism, PheWASs can identify statistical associations between a single variant and multiple phenotypes. They can also provide the basis for a statistical approach to identifying cross-phenotype associations, which can then be verified as true pleiotropic effects.1 Over the past decade, associations from hundreds of genome-wide association studies (GWASs) have accumulated in the EBI GWAS Catalog.5 Although a GWAS typically investigates a single phenotype at a time, the accumulated associations from many studies (such as those in the EBI GWAS Catalog) provide the opportunity to investigate cross-phenotype associations.6, 7 More recently, PheWASs have shown success in identifying cross-phenotype associations within the same study populations.8, 9 Electronic health records (EHRs) are a powerful resource for studying individual outcomes via multiple longitudinal data elements, such as disease diagnoses, laboratory measures, medications, and other health-related information. EHR data have been useful in population health research; more importantly, linking EHR data with genomics data enables us to examine the genetic architecture of various disease outcomes and traits. PheWASs have been an effective tool to mine genetic associations for candidate SNPs or genome-wide variants;10 hence, PheWASs provide the ability to identify cross-phenotype associations in which one SNP is associated with multiple diseases or traits. While investigating such cross-phenotype associations at a genome-wide scale, researchers might uncover potential hidden connections between diseases, especially when two diseases share associations with two or more SNPs that are located in different regions of the genome (Figure 1). One way to examine these connections is by creating a network of diseases in which pairs of diseases are connected on the basis of their shared associations with one or more SNPs. The strength of the network approach is that it condenses the complex links between SNPs and diseases and reveals links between diseases that would be hard to identify by just looking at disease associations at a single locus, such as when one only considers cross-phenotype association with a SNP. Figure 1 Overview of Network Construction The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN). Previous networks based on gene-disease associations, such as the Human Disease Network, used gene-disease associations cataloged in the Online Mendelian Inheritance in Man (OMIM) database.11 Other studies have used summary statistics from the GWAS catalog and/or the Genetic Associations Database (GAD) to investigate SNP-phenotype associations by using network-based analyses.6 However, because these networks are based on summary statistics from disparate studies they have several critical limitations. First, differences in disease phenotype definitions across different studies can impact the interpretation of the association results,12 leading to a high false positive rate. Second, in most cases, networks constructed from summary statistics are limited to providing individual-level genotype and phenotype information. This limitation can restrict the design of follow-up studies conducted on the new hypotheses derived from the network analyses. In this study, we circumvented these limitations by drawing on association results from a single-source EHR that used consistent phenotype definitions and by using a single genotyping platform. We employed genetic associations between 625,325 SNPs and 541 ICD-9 (internal classification of diseases, ninth revision) diagnosis codes from a PheWAS of 38,668 unrelated individuals in Geisinger’s biobank. A disease-disease network (DDN) was constructed from the 31,017 PheWAS association results (p value < 1 × 10−4).13 The DDN revealed thousands of connections between hundreds of diseases, and it also provided a high-level view of disease connections, including known and previously unreported disease links; therefore, to identify relevant disease connections from this dense network, we focused on three broad research goals. One of the key goals was to gain a bird’s-eye view of disease connections characterized by underlying genetic associations. More specifically, when we grouped the diseases into disease classes, we asked which diseases share strong links within a disease class, as well as across different disease classes. The second goal was to integrate functional knowledge of the genome with genetic associations to ascertain biologically relevant findings. We integrated epigenomic knowledge into the DDN and examined the changes on the basis of tissue specificity. A number of recent studies have used EHR data alone to identify disease correlations and comorbidities,14, 15 so our last goal was to explain some of the disease correlations and comorbidities due to shared genetics. We compared the PheWAS-derived DDN to a separate network of diseases identified via an orthogonal EHR-only approach without genetics. Additionally, we used network statistics to mine the DDN for clusters of diseases with known links to one another in order to generate new hypotheses. These disease connections can serve as the basis for new hypotheses to test for comorbidities and pleiotropy. With regard to testing new hypotheses, one of the most significant advantages of our approach is the single-source EHR linked to genomic data; it provides an opportunity to revisit individual-level genotype and phenotype data to design more targeted studies and ask more specific questions. Material and Methods Cross-Phenotype Associations To construct the DDN, we used the genetic associations, identified through the PheWAS approach, that were reported in in our previous study to comprehensively test for associations between 625,325 SNPs and 541 EHR-based phenotypes.13 As part of MyCode initiative, individuals agreed to provide blood and DNA samples for broad, future research, including genomic analyses as part of the Regeneron-Geisinger DiscovEHR collaboration and linking to data in the Geisinger EHR under a protocol approved by the Geisinger Institutional Review Board. The association testing was performed on genotype and phenotype data from 38,668 unrelated individuals. We used 31,017 associations with a p value < 1 × 10−4 to generate a network between disease diagnoses derived from ICD-9 phenotypes.13 Construction of the Network Disease-Disease Network In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources). To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease. Network Statistics Network statistics allow for the descriptive characterization of a network graph and the identification of meaningful connections. In this study, we applied various network analysis approaches to the DDN to identify the most crucial disease nodes, as well as to automate the extraction of disease cluster subnetworks. We used the statistical packages available as plug-ins within Gephi to perform all of the network analytics. Hub Diseases Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics. Community Detection Community detection is an approach used in network analytics to partition a large, densely connected network into smaller subnetworks.22, 23 Various community-detection methods can algorithmically identify meaningful subnetworks. These methods have most commonly been applied in social network analyses for the detection of structure in social interactions.24 We used Louvain’s method,22, 25 which is implemented in Gephi as the “modularity” feature, to partition the DDN and detect subnetworks, or communities, of diseases (see Web Resources). The communities detected had varying types of disease nodes. We used the identified disease communities to further investigate the biological interpretation of disease connections in the DDN. Tissue-Specific Functional Annotation To investigate the tissue-specific disease connections in the network, we used annotations from the 15 chromatin state models available on the Roadmap Epigenomics website to assign chromatin states to different tissues.26 Using posterior probability, we assigned the most probable chromatin states for 127 different tissues, defined via posterior probability, to every 200 base-pair window across the genome. We also consolidated the 127 different tissues into 27 functional groups of tissues; for example, we used four different adipose tissues for the chromatin-state prediction, but we consolidated these into one group called “adipose tissue.”27 To calculate the most probable chromatin state for each functional tissue category, we averaged the posterior probabilities.27 The chromatin-state prediction provides the annotations for the most active to the most quiescent regions of the non-coding genome. In this study, we focused on the active regulatory elements, such as enhancers, promoters, and active transcription start site (TSS); as a proof-of-concept, we only analyzed enhancer-state annotations. The chromosome base pair position of each SNP was mapped onto the annotated chromatin states of the 27 functional groups of tissues. We considered variants to belong inside enhancer regions when a chromosome base pair position mapped onto either of the three enhancer states: enhancer (Enh), genic enhancer (EnhG), and bivalent enhancer (EnhBiv). Then, a total of seven DDNs were constructed from the associations between SNPs in enhancer regions. For visualization, we overlaid the networks created for each tissue onto the original DDN we had constructed. Results Disease-Disease Network Using the cross-phenotype associations found in the EHR-based PheWAS analysis, we constructed a disease-disease network (DDN) in order to understand the genetic similarities between human diseases (Figure 1). The network consists of 385 ICD-9-based disease diagnoses (which we obtained from an original 541 ICD-9 codes by using a threshold of p < 1 × 10-4) acting as nodes and the 1,398 edges connecting them. As shown in Figure 2, we classified ICD-9 codes into 15 broad disease classes, labeled with different colors. The DDN provides a bird’s-eye view of the interconnections between the diseases on the basis of shared genetic associations. Many interconnections, including those between endocrine, musculoskeletal, and neurological disorders, were observed across classes. The strongest connections (indicated by the thickness of the network lines in Figure 2), which are based on the highest number of shared genetic variants, were between autoimmune disorders such as type 1 diabetes (MIM: 222100), rheumatoid arthritis (MIM: 180300), psoriasis (MIM: 177900), and multiple sclerosis (MIM: 126200) (Figure 2). These links are consistent with previous findings suggesting that these autoimmune diseases are determined by shared genetic components, indicating similar pathogenic mechanisms, even if completely different tissue types are affected in each disorder.28, 29, 30, 31 This could indicate that there are shared genetic pathways linking multiple SNPs to the same diseases. This could also be a reflection of a high correlation between disease occurrences. Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block. Diseases Connected to the Most Other Diseases Next, we focused on the disease nodes with the highest number of direct connections with other diseases in the network. The degree property (K) of the network represents the number of neighbors for each node. We observed that on average each disease shares direct links with seven other diseases (K = 7, Figure 3). With links to 32 diseases, hypothyroidism had the highest degree property (K = 32) in the network. In hypothyroidism, a disorder of the endocrine system, the thyroid gland does not produce enough thyroid hormones, and this deficiency can lead to the development of other diseases. Some comorbidities observed in the DDN were morbid obesity,32, 33 type 2 diabetes mellitus (MIM: 125853),34 vitamin D deficiency,35 hypertensive heart disease,36 thyroid cancer, and rheumatoid arthritis.37 On the other end of the scale, five diseases (blepharitis; “acute, but ill-defined, cerebrovascular disease; hyposmolality and/or hyponatremia; pain in joint; and goiter) had links to only one neighboring disease (K = 1). Thus, representing cross-phenotype associations in the form of networks enabled visualization of complex interconnections between different diseases. Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN. Hub Diseases in the DDN To further characterize the DDN, we applied different network statistics to identify disease nodes necessary for the cohesiveness of the network. Such nodes are also commonly referred to as hub nodes (see Material and Methods). We used a betweenness centrality measure to identify hub nodes, which are represented in the DDN by larger nodes (Figure 2). We identified many hub nodes in different disease classes across the DDN; the highest number were in endocrine disorders and included hypothyroidism, type 1 diabetes, and type 2 diabetes (Figure 2). Other main hub nodes that we observed in the DDN were psoriasis, morbid obesity, multiple sclerosis, rheumatoid arthritis, coronary atherosclerosis, and chronic kidney disease. Identifying Biologically Relevant Subnetworks via Epigenomics These results demonstrate that community detection is a good approach to visualizing the global and local structures of disease interaction. To further test whether the disease nodes and the connections between them are relevant to molecular mechanisms of disease, we incorporated chromatin-state annotations from the Roadmap Epigenomics Consortium and used them to extract biologically relevant subnetworks by using a similar approach. We only considered SNPs within enhancer regions for specific tissues for the current analysis. Seven tissue-specific DDNs were constructed from the shared variants in enhancer regions. The largest observed subnetwork where SNPs were in active enhancer regions was in the liver. The associated diseases for this tissue included 19 diseases, such as cirrhosis of the liver, chronic non-alcoholic liver disease, hyperlipidemia, morbid obesity, essential hypertension, and cardiovascular diseases, among others (Table S2). For adipose tissue, there were eight diseases in the subnetwork, including links between cardiovascular, nutritional, endocrine, and autoimmune diseases (Figure 4). Only two of the nodes in this subnetwork were connected to each other. Within the adipose subnetwork, we observed connections between cardiovascular diseases such as peripheral vascular disease, myocardial infarction, coronary artery disease, and abdominal aneurysm. Supporting these connections, previous studies have reported known links between increased gene expression in adipose tissue and cardiovascular diseases.24, 25 The second node was for type 1 diabetes, which had connections to psoriasis and Raynaud syndrome. Psoriasis and type 1 diabetes are both autoimmune diseases, and they share associations with the variation in the human leukocyte antigen (HLA) region. Numerous studies have identified strong connections between the pathogenesis of these autoimmune diseases and variations in HLA.38, 39 Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene. Community Detection EHR data provide a vast amount of information pertaining to diseases. Machine-learning approaches are being applied to longitudinal EHR data so that predictive models of disease correlations, risk predictions, and comorbidities can be developed.40, 41, 42 EHR-based predictive models can be used for combining disease connections into a network similar to the DDN. To compare the DDN with networks from longitudinal EHR data, we applied a probabilistic relationship model to ICD-9 diagnoses derived from the same Geisinger longitudinal EHR data (unpublished data). These prediction models were developed under an Ising model framework,43 and all the predictions were based on EHR data alone. The Ising model is a type of Markov random field (MRF) graphical model for binary data.44 It provides an approximation of the full joint-probability distribution across hundreds of ICD-9 codes. Thus, it can help to uncover patterns of dependencies between ICD-9 codes that result from either shared genetic or environmental architecture. This predictive algorithm generated a graphical model of disease states for 500 ICD-9 codes; this model is a representation of similarities between ICD-9 codes. Then we evaluated whether we observed the same links that we identified in the PheWAS-derived DDN. Rather than comparing all the disease connections, which would be computationally intensive, we applied the community-detection method in Gephi to the DDN in order to find subnetworks algorithmically. The method found nine communities; as shown in Figure 5, the number of diseases in each community varied between clusters of 2 and 102. Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community. Next, we selected one community that encompassed 20 diseases and showed connections between different disease classes, such as nutritional, neurological, cardiovascular, skin, and digestive-system disorders (Figure 6A). We compared this subnetwork of the DDN with the network derived from probabilistic graphical model of disease state, wherein disease state is defined as the status of all ICD-9 code diagnoses in an individual’s EHR. We used the Ising model framework to develop the probabilistic graphical model of disease state. We checked to see whether we could observe some of the links we identified in our DDN subnetwork (identified via community detection) in the Ising model of disease state (Figure 6B). Through this independent investigation, we identified direct and indirect connections between ICD-9 codes in the Ising model network; these connections were similar to those found in the DDN. Thus, we demonstrated a probabilistic dependence between these diagnosis codes in line with what we see in our network. When we compared the morbid obesity associated with diseases directly neighboring one another in both the DDN and the Ising model (Figure 6), we found many similarities. Specifically, the comorbidities that showed direct links to morbid obesity in both networks were sleep apnea,45 lumbago,46 and edema.47 These results suggest that the probabilistic dependencies observed between these diseases in the Ising model network can probably be explained by the shared genetic architecture that was identified through the DDN. In the DDN, we also found links between morbid obesity and cardiovascular diseases (coronary atherosclerosis and intermediate coronary syndrome), which are known comorbidities.45 Other interesting links with morbid obesity were bariatric-surgery-associated conditions, such as post-gastric absorption and post-surgical non-absorption. It is possible that these connections might be due to a diagnosis correlation that arose in the EHR when an individual underwent bariatric surgery because of their pre-existing condition of morbid obesity. Gout was also a comorbidity of morbid obesity.45 However, these diseases were connected indirectly through another comorbidity: sleep apnea. With this example, we highlight the core strength of EHR-based studies, which allow us to answer similar questions about disease relationships with different methods and thereby provide more robustness to the findings. Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B). Discussion In this study, we generated and evaluated a network of cross-phenotype associations derived from an EHR-based PheWAS. In contrast to previous disease networks, which were built of summary statistics from disparate studies, the DDN benefits from utilizing a single source of EHR data. The network analyses performed on the DDN have illuminated deeper structures within and across disease classes. For example, autoimmune diseases are caused by dysfunctional immune systems that attack the healthy cells in a variety of organs. Type 1 diabetes, rheumatoid arthritis, and multiple sclerosis were some of the common autoimmune conditions within the DDN. Although these conditions have distinct symptoms, previous findings have shown strong evidence that complex interactions occur between these diseases as a result of shared genetic architecture.48, 49 The identification of these previously known findings regarding these autoimmune diseases provides support for the network approach of investigating cross-phenotype associations derived from PheWASs. In this study, the SNPs linking these autoimmune diseases mapped to 19 genes, variations in all of which were associated with increased risk of autoimmune disease (Table S2). Two genes, C6orf10 (chromosome 6 open reading frame 10 [MIM: 618151]) and TAP2 (transporter 2, ATP-binding cassette, subfamily B [MIM: 170261]), were the only two genes linked to three autoimmune diseases: type 1 diabetes, rheumatoid arthritis, and multiple sclerosis. Of the 19 genes, C2 (complement component 2 [MIM: 613927]), HCG26 (HLA complex group 26 [HGNC: 29671]), and PSMB8 (proteasome subunit beta 8 [MIM: 177046]) had no previously known associations with autoimmune diseases. However, we replicated the findings of a genetic study of one of the largest European American cohorts (UK Biobank), which revealed associations between rheumatoid arthritis and multiple sclerosis.54 Additionally, we performed a gene ontology (GO) enrichment analysis with genes shared between type 1 diabetes, multiple sclerosis, and rheumatoid arthritis. Notably, many immune-system-process-related GO terms were identified (Table S3). Using epigenomics, we found that a variant in HCG26, one of the 19 genes, is located in the enhancer region targeting LTA (lymphotoxin alpha [MIM: 153440]); the variant was identified in multiple tissues by the fine mapping approach described in Verma et al..13 (dbSNP: rs2523663). LTA is a protein-coding gene that encodes cytokines produced by lymphocytes in the immune system (see NCBI in Web Resources). Cytokines play an important role in the pathogenesis of various autoimmune disorders, and cytokine-inhibiting agents are key drug targets for type 1 diabetes and multiple sclerosis.50, 51, 52, 53 Because of the many key genes shared between connected diseases, along with the epigenetic regulation, cytokine-inhibiting agents may offer intervention strategies to satisfy the unmet medical needs that still exist in those connected diseases. Additionally, we identified previously unreported disease connections by using the DDN approach. For example, we found that links between morbid obesity and its known comorbidities can be explained by shared genetic associations. These comorbidities were not present in the Human Disease Network (Figure S1). This inconsistency might be explained by differences in the phenotypes used to construct the network. We also demonstrated similarities between networks formed from two distinct predictive algorithms from the same EHR system. Taken together, these results suggest that the probabilistic dependencies observed between certain diseases (e.g., morbid obesity, sleep apnea, lumbago, gout, venous insufficiency, and edema) in the Ising model can be explained by shared genetic architecture identified via our disease-disease network. With this example, we highlight the core strength of EHR-based studies: the ability to apply different approaches, such as using genetic and/or phenotypic information, in order to arrive at a stronger conclusion. The potential strength of the DDN is to identify disease connections that were not expected. From the DDN generated in this study, we found that hyperlipidemia was linked to not only atherosclerosis, but also many immune-related diseases, such as type I diabetes, psoriasis, hypothyroidism, and multiple sclerosis, as well as other immune-mediated diseases, such as allergic rhinitis, blepharitis, acute bronchitis, and herpes. These unexpected observations indicate the non-canonical role of the immune system in lipid-metabolizing disorders and/or the pathogenic role of hyperlipidemia in immune responses. Indeed, lymphotoxin (LT) and LIGHT, two tumor necrosis factor cytokine family members that are primarily expressed on lymphocytes, are critical regulators of key enzymes that control lipid metabolism in mouse models.54 Although further studies are warranted to infer the causality of these associations, our DNN confirmed the shared genetic risk of hyperlipidemia and immune diseases. In conclusion, community detection is a powerful method to identify and visualize cross-phenotype associations from analysis of PheWASs. It uncovered previously unreported shared links in known interactions between diseases, as well as other unreported connections between diseases. It also provided a way to generate new hypotheses to guide further targeted investigation into comorbidities, pleiotropy, and epistasis. Although we explored interconnections between multiple diseases in an EHR-based population, this approach can also be applied to many publicly available resources with summary-level and individual-level data on multiple phenotypes. Networks similar to the ones generated here could be adapted from NHANES, UK BioBank,55 GERA,56 eMERGE,57 and the Million Veteran Program, among other populations. Furthermore, we plan to extend the network analysis by including associations between genetic variants and clinical laboratory measures in EHR. This work provides new avenues by which network-based methods can be applied to large, gene-trait-based studies to uncover the genetic underpinnings of disease. Lastly, an interactive visualization tool of the disease-disease network is available (see Web Resources). Declaration of Interests The authors declare no competing interests. Web Resources Disease-Disease Network Visualization Tool, https://www.biomedinfolab.com/software eMERGE, https://emerge.mc.vanderbilt.edu Gephi, https://gephi.org Million Veteran Program, https://www.research.va.gov/mvp/ NCBI, https://www.ncbi.nlm.nih.gov/gene/4049 NHANES, https://www.cdc.gov/nchs/nhanes/index.htm OMIM, http://www.omim.org/ Roadmap Epigenomics Project, http://www.roadmapepigenomics.org/data/ UK BioBank, https://www.ukbiobank.ac.uk Supplemental Data Document S1. Figures S1 and Tables S2 and S3 Table S1. Summary Information of Disease Pairs and the Number of Shared SNP Associations Used for Creating the Disease-Disease Network Document S2. Article plus Supplemental Data Acknowledgments This work was supported by the National Library of Medicine (NLM) R01 NL012535. This project is also funded, in part, by a grant provided by the Pennsylvania Department of Health (#SAP 4100070267). The Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions. Supplemental Data include one figure and three tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.11.006.

Document structure show

article-title	Human-Disease Phenotype Map Derived from PheWAS across 38,682 Individuals
abstract	Phenome-wide association studies (PheWASs) have been a useful tool for testing associations between genetic variations and multiple complex traits or diagnoses. Linking PheWAS-based associations between phenotypes and a variant or a genomic region into a network provides a new way to investigate cross-phenotype associations, and it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy. We created a network of associations from one of the largest PheWASs on electronic health record (EHR)-derived phenotypes across 38,682 unrelated samples from the Geisinger’s biobank; the samples were genotyped through the DiscovEHR project. We computed associations between 632,574 common variants and 541 diagnosis codes. Using these associations, we constructed a “disease-disease” network (DDN) wherein pairs of diseases were connected on the basis of shared associations with a given genetic variant. The DDN provides a landscape of intra-connections within the same disease classes, as well as inter-connections across disease classes. We identified clusters of diseases with known biological connections, such as autoimmune disorders (type 1 diabetes, rheumatoid arthritis, and multiple sclerosis) and cardiovascular disorders. Previously unreported relationships between multiple diseases were identified on the basis of genetic associations as well. The network approach applied in this study can be used to uncover interactions between diseases as a result of their shared, potentially pleiotropic SNPs. Additionally, this approach might advance clinical research and even clinical practice by accelerating our understanding of disease mechanisms on the basis of similar underlying genetic associations.
p	Phenome-wide association studies (PheWASs) have been a useful tool for testing associations between genetic variations and multiple complex traits or diagnoses. Linking PheWAS-based associations between phenotypes and a variant or a genomic region into a network provides a new way to investigate cross-phenotype associations, and it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy. We created a network of associations from one of the largest PheWASs on electronic health record (EHR)-derived phenotypes across 38,682 unrelated samples from the Geisinger’s biobank; the samples were genotyped through the DiscovEHR project. We computed associations between 632,574 common variants and 541 diagnosis codes. Using these associations, we constructed a “disease-disease” network (DDN) wherein pairs of diseases were connected on the basis of shared associations with a given genetic variant. The DDN provides a landscape of intra-connections within the same disease classes, as well as inter-connections across disease classes. We identified clusters of diseases with known biological connections, such as autoimmune disorders (type 1 diabetes, rheumatoid arthritis, and multiple sclerosis) and cardiovascular disorders. Previously unreported relationships between multiple diseases were identified on the basis of genetic associations as well. The network approach applied in this study can be used to uncover interactions between diseases as a result of their shared, potentially pleiotropic SNPs. Additionally, this approach might advance clinical research and even clinical practice by accelerating our understanding of disease mechanisms on the basis of similar underlying genetic associations.
body	Introduction Pleiotropy occurs when a given locus (e.g., a SNP or gene) influences two or more different phenotypes or traits. The phenome-wide association study (PheWAS) is an important tool that has the strength to identify associations between genetic variants and clinical phenotypes and also the potential to reveal pleiotropic associations among diseases.1, 2, 3, 4 Although pleiotropy often refers to a common molecular mechanism, PheWASs can identify statistical associations between a single variant and multiple phenotypes. They can also provide the basis for a statistical approach to identifying cross-phenotype associations, which can then be verified as true pleiotropic effects.1 Over the past decade, associations from hundreds of genome-wide association studies (GWASs) have accumulated in the EBI GWAS Catalog.5 Although a GWAS typically investigates a single phenotype at a time, the accumulated associations from many studies (such as those in the EBI GWAS Catalog) provide the opportunity to investigate cross-phenotype associations.6, 7 More recently, PheWASs have shown success in identifying cross-phenotype associations within the same study populations.8, 9 Electronic health records (EHRs) are a powerful resource for studying individual outcomes via multiple longitudinal data elements, such as disease diagnoses, laboratory measures, medications, and other health-related information. EHR data have been useful in population health research; more importantly, linking EHR data with genomics data enables us to examine the genetic architecture of various disease outcomes and traits. PheWASs have been an effective tool to mine genetic associations for candidate SNPs or genome-wide variants;10 hence, PheWASs provide the ability to identify cross-phenotype associations in which one SNP is associated with multiple diseases or traits. While investigating such cross-phenotype associations at a genome-wide scale, researchers might uncover potential hidden connections between diseases, especially when two diseases share associations with two or more SNPs that are located in different regions of the genome (Figure 1). One way to examine these connections is by creating a network of diseases in which pairs of diseases are connected on the basis of their shared associations with one or more SNPs. The strength of the network approach is that it condenses the complex links between SNPs and diseases and reveals links between diseases that would be hard to identify by just looking at disease associations at a single locus, such as when one only considers cross-phenotype association with a SNP. Figure 1 Overview of Network Construction The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN). Previous networks based on gene-disease associations, such as the Human Disease Network, used gene-disease associations cataloged in the Online Mendelian Inheritance in Man (OMIM) database.11 Other studies have used summary statistics from the GWAS catalog and/or the Genetic Associations Database (GAD) to investigate SNP-phenotype associations by using network-based analyses.6 However, because these networks are based on summary statistics from disparate studies they have several critical limitations. First, differences in disease phenotype definitions across different studies can impact the interpretation of the association results,12 leading to a high false positive rate. Second, in most cases, networks constructed from summary statistics are limited to providing individual-level genotype and phenotype information. This limitation can restrict the design of follow-up studies conducted on the new hypotheses derived from the network analyses. In this study, we circumvented these limitations by drawing on association results from a single-source EHR that used consistent phenotype definitions and by using a single genotyping platform. We employed genetic associations between 625,325 SNPs and 541 ICD-9 (internal classification of diseases, ninth revision) diagnosis codes from a PheWAS of 38,668 unrelated individuals in Geisinger’s biobank. A disease-disease network (DDN) was constructed from the 31,017 PheWAS association results (p value < 1 × 10−4).13 The DDN revealed thousands of connections between hundreds of diseases, and it also provided a high-level view of disease connections, including known and previously unreported disease links; therefore, to identify relevant disease connections from this dense network, we focused on three broad research goals. One of the key goals was to gain a bird’s-eye view of disease connections characterized by underlying genetic associations. More specifically, when we grouped the diseases into disease classes, we asked which diseases share strong links within a disease class, as well as across different disease classes. The second goal was to integrate functional knowledge of the genome with genetic associations to ascertain biologically relevant findings. We integrated epigenomic knowledge into the DDN and examined the changes on the basis of tissue specificity. A number of recent studies have used EHR data alone to identify disease correlations and comorbidities,14, 15 so our last goal was to explain some of the disease correlations and comorbidities due to shared genetics. We compared the PheWAS-derived DDN to a separate network of diseases identified via an orthogonal EHR-only approach without genetics. Additionally, we used network statistics to mine the DDN for clusters of diseases with known links to one another in order to generate new hypotheses. These disease connections can serve as the basis for new hypotheses to test for comorbidities and pleiotropy. With regard to testing new hypotheses, one of the most significant advantages of our approach is the single-source EHR linked to genomic data; it provides an opportunity to revisit individual-level genotype and phenotype data to design more targeted studies and ask more specific questions. Material and Methods Cross-Phenotype Associations To construct the DDN, we used the genetic associations, identified through the PheWAS approach, that were reported in in our previous study to comprehensively test for associations between 625,325 SNPs and 541 EHR-based phenotypes.13 As part of MyCode initiative, individuals agreed to provide blood and DNA samples for broad, future research, including genomic analyses as part of the Regeneron-Geisinger DiscovEHR collaboration and linking to data in the Geisinger EHR under a protocol approved by the Geisinger Institutional Review Board. The association testing was performed on genotype and phenotype data from 38,668 unrelated individuals. We used 31,017 associations with a p value < 1 × 10−4 to generate a network between disease diagnoses derived from ICD-9 phenotypes.13 Construction of the Network Disease-Disease Network In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources). To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease. Network Statistics Network statistics allow for the descriptive characterization of a network graph and the identification of meaningful connections. In this study, we applied various network analysis approaches to the DDN to identify the most crucial disease nodes, as well as to automate the extraction of disease cluster subnetworks. We used the statistical packages available as plug-ins within Gephi to perform all of the network analytics. Hub Diseases Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics. Community Detection Community detection is an approach used in network analytics to partition a large, densely connected network into smaller subnetworks.22, 23 Various community-detection methods can algorithmically identify meaningful subnetworks. These methods have most commonly been applied in social network analyses for the detection of structure in social interactions.24 We used Louvain’s method,22, 25 which is implemented in Gephi as the “modularity” feature, to partition the DDN and detect subnetworks, or communities, of diseases (see Web Resources). The communities detected had varying types of disease nodes. We used the identified disease communities to further investigate the biological interpretation of disease connections in the DDN. Tissue-Specific Functional Annotation To investigate the tissue-specific disease connections in the network, we used annotations from the 15 chromatin state models available on the Roadmap Epigenomics website to assign chromatin states to different tissues.26 Using posterior probability, we assigned the most probable chromatin states for 127 different tissues, defined via posterior probability, to every 200 base-pair window across the genome. We also consolidated the 127 different tissues into 27 functional groups of tissues; for example, we used four different adipose tissues for the chromatin-state prediction, but we consolidated these into one group called “adipose tissue.”27 To calculate the most probable chromatin state for each functional tissue category, we averaged the posterior probabilities.27 The chromatin-state prediction provides the annotations for the most active to the most quiescent regions of the non-coding genome. In this study, we focused on the active regulatory elements, such as enhancers, promoters, and active transcription start site (TSS); as a proof-of-concept, we only analyzed enhancer-state annotations. The chromosome base pair position of each SNP was mapped onto the annotated chromatin states of the 27 functional groups of tissues. We considered variants to belong inside enhancer regions when a chromosome base pair position mapped onto either of the three enhancer states: enhancer (Enh), genic enhancer (EnhG), and bivalent enhancer (EnhBiv). Then, a total of seven DDNs were constructed from the associations between SNPs in enhancer regions. For visualization, we overlaid the networks created for each tissue onto the original DDN we had constructed. Results Disease-Disease Network Using the cross-phenotype associations found in the EHR-based PheWAS analysis, we constructed a disease-disease network (DDN) in order to understand the genetic similarities between human diseases (Figure 1). The network consists of 385 ICD-9-based disease diagnoses (which we obtained from an original 541 ICD-9 codes by using a threshold of p < 1 × 10-4) acting as nodes and the 1,398 edges connecting them. As shown in Figure 2, we classified ICD-9 codes into 15 broad disease classes, labeled with different colors. The DDN provides a bird’s-eye view of the interconnections between the diseases on the basis of shared genetic associations. Many interconnections, including those between endocrine, musculoskeletal, and neurological disorders, were observed across classes. The strongest connections (indicated by the thickness of the network lines in Figure 2), which are based on the highest number of shared genetic variants, were between autoimmune disorders such as type 1 diabetes (MIM: 222100), rheumatoid arthritis (MIM: 180300), psoriasis (MIM: 177900), and multiple sclerosis (MIM: 126200) (Figure 2). These links are consistent with previous findings suggesting that these autoimmune diseases are determined by shared genetic components, indicating similar pathogenic mechanisms, even if completely different tissue types are affected in each disorder.28, 29, 30, 31 This could indicate that there are shared genetic pathways linking multiple SNPs to the same diseases. This could also be a reflection of a high correlation between disease occurrences. Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block. Diseases Connected to the Most Other Diseases Next, we focused on the disease nodes with the highest number of direct connections with other diseases in the network. The degree property (K) of the network represents the number of neighbors for each node. We observed that on average each disease shares direct links with seven other diseases (K = 7, Figure 3). With links to 32 diseases, hypothyroidism had the highest degree property (K = 32) in the network. In hypothyroidism, a disorder of the endocrine system, the thyroid gland does not produce enough thyroid hormones, and this deficiency can lead to the development of other diseases. Some comorbidities observed in the DDN were morbid obesity,32, 33 type 2 diabetes mellitus (MIM: 125853),34 vitamin D deficiency,35 hypertensive heart disease,36 thyroid cancer, and rheumatoid arthritis.37 On the other end of the scale, five diseases (blepharitis; “acute, but ill-defined, cerebrovascular disease; hyposmolality and/or hyponatremia; pain in joint; and goiter) had links to only one neighboring disease (K = 1). Thus, representing cross-phenotype associations in the form of networks enabled visualization of complex interconnections between different diseases. Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN. Hub Diseases in the DDN To further characterize the DDN, we applied different network statistics to identify disease nodes necessary for the cohesiveness of the network. Such nodes are also commonly referred to as hub nodes (see Material and Methods). We used a betweenness centrality measure to identify hub nodes, which are represented in the DDN by larger nodes (Figure 2). We identified many hub nodes in different disease classes across the DDN; the highest number were in endocrine disorders and included hypothyroidism, type 1 diabetes, and type 2 diabetes (Figure 2). Other main hub nodes that we observed in the DDN were psoriasis, morbid obesity, multiple sclerosis, rheumatoid arthritis, coronary atherosclerosis, and chronic kidney disease. Identifying Biologically Relevant Subnetworks via Epigenomics These results demonstrate that community detection is a good approach to visualizing the global and local structures of disease interaction. To further test whether the disease nodes and the connections between them are relevant to molecular mechanisms of disease, we incorporated chromatin-state annotations from the Roadmap Epigenomics Consortium and used them to extract biologically relevant subnetworks by using a similar approach. We only considered SNPs within enhancer regions for specific tissues for the current analysis. Seven tissue-specific DDNs were constructed from the shared variants in enhancer regions. The largest observed subnetwork where SNPs were in active enhancer regions was in the liver. The associated diseases for this tissue included 19 diseases, such as cirrhosis of the liver, chronic non-alcoholic liver disease, hyperlipidemia, morbid obesity, essential hypertension, and cardiovascular diseases, among others (Table S2). For adipose tissue, there were eight diseases in the subnetwork, including links between cardiovascular, nutritional, endocrine, and autoimmune diseases (Figure 4). Only two of the nodes in this subnetwork were connected to each other. Within the adipose subnetwork, we observed connections between cardiovascular diseases such as peripheral vascular disease, myocardial infarction, coronary artery disease, and abdominal aneurysm. Supporting these connections, previous studies have reported known links between increased gene expression in adipose tissue and cardiovascular diseases.24, 25 The second node was for type 1 diabetes, which had connections to psoriasis and Raynaud syndrome. Psoriasis and type 1 diabetes are both autoimmune diseases, and they share associations with the variation in the human leukocyte antigen (HLA) region. Numerous studies have identified strong connections between the pathogenesis of these autoimmune diseases and variations in HLA.38, 39 Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene. Community Detection EHR data provide a vast amount of information pertaining to diseases. Machine-learning approaches are being applied to longitudinal EHR data so that predictive models of disease correlations, risk predictions, and comorbidities can be developed.40, 41, 42 EHR-based predictive models can be used for combining disease connections into a network similar to the DDN. To compare the DDN with networks from longitudinal EHR data, we applied a probabilistic relationship model to ICD-9 diagnoses derived from the same Geisinger longitudinal EHR data (unpublished data). These prediction models were developed under an Ising model framework,43 and all the predictions were based on EHR data alone. The Ising model is a type of Markov random field (MRF) graphical model for binary data.44 It provides an approximation of the full joint-probability distribution across hundreds of ICD-9 codes. Thus, it can help to uncover patterns of dependencies between ICD-9 codes that result from either shared genetic or environmental architecture. This predictive algorithm generated a graphical model of disease states for 500 ICD-9 codes; this model is a representation of similarities between ICD-9 codes. Then we evaluated whether we observed the same links that we identified in the PheWAS-derived DDN. Rather than comparing all the disease connections, which would be computationally intensive, we applied the community-detection method in Gephi to the DDN in order to find subnetworks algorithmically. The method found nine communities; as shown in Figure 5, the number of diseases in each community varied between clusters of 2 and 102. Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community. Next, we selected one community that encompassed 20 diseases and showed connections between different disease classes, such as nutritional, neurological, cardiovascular, skin, and digestive-system disorders (Figure 6A). We compared this subnetwork of the DDN with the network derived from probabilistic graphical model of disease state, wherein disease state is defined as the status of all ICD-9 code diagnoses in an individual’s EHR. We used the Ising model framework to develop the probabilistic graphical model of disease state. We checked to see whether we could observe some of the links we identified in our DDN subnetwork (identified via community detection) in the Ising model of disease state (Figure 6B). Through this independent investigation, we identified direct and indirect connections between ICD-9 codes in the Ising model network; these connections were similar to those found in the DDN. Thus, we demonstrated a probabilistic dependence between these diagnosis codes in line with what we see in our network. When we compared the morbid obesity associated with diseases directly neighboring one another in both the DDN and the Ising model (Figure 6), we found many similarities. Specifically, the comorbidities that showed direct links to morbid obesity in both networks were sleep apnea,45 lumbago,46 and edema.47 These results suggest that the probabilistic dependencies observed between these diseases in the Ising model network can probably be explained by the shared genetic architecture that was identified through the DDN. In the DDN, we also found links between morbid obesity and cardiovascular diseases (coronary atherosclerosis and intermediate coronary syndrome), which are known comorbidities.45 Other interesting links with morbid obesity were bariatric-surgery-associated conditions, such as post-gastric absorption and post-surgical non-absorption. It is possible that these connections might be due to a diagnosis correlation that arose in the EHR when an individual underwent bariatric surgery because of their pre-existing condition of morbid obesity. Gout was also a comorbidity of morbid obesity.45 However, these diseases were connected indirectly through another comorbidity: sleep apnea. With this example, we highlight the core strength of EHR-based studies, which allow us to answer similar questions about disease relationships with different methods and thereby provide more robustness to the findings. Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B). Discussion In this study, we generated and evaluated a network of cross-phenotype associations derived from an EHR-based PheWAS. In contrast to previous disease networks, which were built of summary statistics from disparate studies, the DDN benefits from utilizing a single source of EHR data. The network analyses performed on the DDN have illuminated deeper structures within and across disease classes. For example, autoimmune diseases are caused by dysfunctional immune systems that attack the healthy cells in a variety of organs. Type 1 diabetes, rheumatoid arthritis, and multiple sclerosis were some of the common autoimmune conditions within the DDN. Although these conditions have distinct symptoms, previous findings have shown strong evidence that complex interactions occur between these diseases as a result of shared genetic architecture.48, 49 The identification of these previously known findings regarding these autoimmune diseases provides support for the network approach of investigating cross-phenotype associations derived from PheWASs. In this study, the SNPs linking these autoimmune diseases mapped to 19 genes, variations in all of which were associated with increased risk of autoimmune disease (Table S2). Two genes, C6orf10 (chromosome 6 open reading frame 10 [MIM: 618151]) and TAP2 (transporter 2, ATP-binding cassette, subfamily B [MIM: 170261]), were the only two genes linked to three autoimmune diseases: type 1 diabetes, rheumatoid arthritis, and multiple sclerosis. Of the 19 genes, C2 (complement component 2 [MIM: 613927]), HCG26 (HLA complex group 26 [HGNC: 29671]), and PSMB8 (proteasome subunit beta 8 [MIM: 177046]) had no previously known associations with autoimmune diseases. However, we replicated the findings of a genetic study of one of the largest European American cohorts (UK Biobank), which revealed associations between rheumatoid arthritis and multiple sclerosis.54 Additionally, we performed a gene ontology (GO) enrichment analysis with genes shared between type 1 diabetes, multiple sclerosis, and rheumatoid arthritis. Notably, many immune-system-process-related GO terms were identified (Table S3). Using epigenomics, we found that a variant in HCG26, one of the 19 genes, is located in the enhancer region targeting LTA (lymphotoxin alpha [MIM: 153440]); the variant was identified in multiple tissues by the fine mapping approach described in Verma et al..13 (dbSNP: rs2523663). LTA is a protein-coding gene that encodes cytokines produced by lymphocytes in the immune system (see NCBI in Web Resources). Cytokines play an important role in the pathogenesis of various autoimmune disorders, and cytokine-inhibiting agents are key drug targets for type 1 diabetes and multiple sclerosis.50, 51, 52, 53 Because of the many key genes shared between connected diseases, along with the epigenetic regulation, cytokine-inhibiting agents may offer intervention strategies to satisfy the unmet medical needs that still exist in those connected diseases. Additionally, we identified previously unreported disease connections by using the DDN approach. For example, we found that links between morbid obesity and its known comorbidities can be explained by shared genetic associations. These comorbidities were not present in the Human Disease Network (Figure S1). This inconsistency might be explained by differences in the phenotypes used to construct the network. We also demonstrated similarities between networks formed from two distinct predictive algorithms from the same EHR system. Taken together, these results suggest that the probabilistic dependencies observed between certain diseases (e.g., morbid obesity, sleep apnea, lumbago, gout, venous insufficiency, and edema) in the Ising model can be explained by shared genetic architecture identified via our disease-disease network. With this example, we highlight the core strength of EHR-based studies: the ability to apply different approaches, such as using genetic and/or phenotypic information, in order to arrive at a stronger conclusion. The potential strength of the DDN is to identify disease connections that were not expected. From the DDN generated in this study, we found that hyperlipidemia was linked to not only atherosclerosis, but also many immune-related diseases, such as type I diabetes, psoriasis, hypothyroidism, and multiple sclerosis, as well as other immune-mediated diseases, such as allergic rhinitis, blepharitis, acute bronchitis, and herpes. These unexpected observations indicate the non-canonical role of the immune system in lipid-metabolizing disorders and/or the pathogenic role of hyperlipidemia in immune responses. Indeed, lymphotoxin (LT) and LIGHT, two tumor necrosis factor cytokine family members that are primarily expressed on lymphocytes, are critical regulators of key enzymes that control lipid metabolism in mouse models.54 Although further studies are warranted to infer the causality of these associations, our DNN confirmed the shared genetic risk of hyperlipidemia and immune diseases. In conclusion, community detection is a powerful method to identify and visualize cross-phenotype associations from analysis of PheWASs. It uncovered previously unreported shared links in known interactions between diseases, as well as other unreported connections between diseases. It also provided a way to generate new hypotheses to guide further targeted investigation into comorbidities, pleiotropy, and epistasis. Although we explored interconnections between multiple diseases in an EHR-based population, this approach can also be applied to many publicly available resources with summary-level and individual-level data on multiple phenotypes. Networks similar to the ones generated here could be adapted from NHANES, UK BioBank,55 GERA,56 eMERGE,57 and the Million Veteran Program, among other populations. Furthermore, we plan to extend the network analysis by including associations between genetic variants and clinical laboratory measures in EHR. This work provides new avenues by which network-based methods can be applied to large, gene-trait-based studies to uncover the genetic underpinnings of disease. Lastly, an interactive visualization tool of the disease-disease network is available (see Web Resources). Declaration of Interests The authors declare no competing interests.
sec	Introduction Pleiotropy occurs when a given locus (e.g., a SNP or gene) influences two or more different phenotypes or traits. The phenome-wide association study (PheWAS) is an important tool that has the strength to identify associations between genetic variants and clinical phenotypes and also the potential to reveal pleiotropic associations among diseases.1, 2, 3, 4 Although pleiotropy often refers to a common molecular mechanism, PheWASs can identify statistical associations between a single variant and multiple phenotypes. They can also provide the basis for a statistical approach to identifying cross-phenotype associations, which can then be verified as true pleiotropic effects.1 Over the past decade, associations from hundreds of genome-wide association studies (GWASs) have accumulated in the EBI GWAS Catalog.5 Although a GWAS typically investigates a single phenotype at a time, the accumulated associations from many studies (such as those in the EBI GWAS Catalog) provide the opportunity to investigate cross-phenotype associations.6, 7 More recently, PheWASs have shown success in identifying cross-phenotype associations within the same study populations.8, 9 Electronic health records (EHRs) are a powerful resource for studying individual outcomes via multiple longitudinal data elements, such as disease diagnoses, laboratory measures, medications, and other health-related information. EHR data have been useful in population health research; more importantly, linking EHR data with genomics data enables us to examine the genetic architecture of various disease outcomes and traits. PheWASs have been an effective tool to mine genetic associations for candidate SNPs or genome-wide variants;10 hence, PheWASs provide the ability to identify cross-phenotype associations in which one SNP is associated with multiple diseases or traits. While investigating such cross-phenotype associations at a genome-wide scale, researchers might uncover potential hidden connections between diseases, especially when two diseases share associations with two or more SNPs that are located in different regions of the genome (Figure 1). One way to examine these connections is by creating a network of diseases in which pairs of diseases are connected on the basis of their shared associations with one or more SNPs. The strength of the network approach is that it condenses the complex links between SNPs and diseases and reveals links between diseases that would be hard to identify by just looking at disease associations at a single locus, such as when one only considers cross-phenotype association with a SNP. Figure 1 Overview of Network Construction The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN). Previous networks based on gene-disease associations, such as the Human Disease Network, used gene-disease associations cataloged in the Online Mendelian Inheritance in Man (OMIM) database.11 Other studies have used summary statistics from the GWAS catalog and/or the Genetic Associations Database (GAD) to investigate SNP-phenotype associations by using network-based analyses.6 However, because these networks are based on summary statistics from disparate studies they have several critical limitations. First, differences in disease phenotype definitions across different studies can impact the interpretation of the association results,12 leading to a high false positive rate. Second, in most cases, networks constructed from summary statistics are limited to providing individual-level genotype and phenotype information. This limitation can restrict the design of follow-up studies conducted on the new hypotheses derived from the network analyses. In this study, we circumvented these limitations by drawing on association results from a single-source EHR that used consistent phenotype definitions and by using a single genotyping platform. We employed genetic associations between 625,325 SNPs and 541 ICD-9 (internal classification of diseases, ninth revision) diagnosis codes from a PheWAS of 38,668 unrelated individuals in Geisinger’s biobank. A disease-disease network (DDN) was constructed from the 31,017 PheWAS association results (p value < 1 × 10−4).13 The DDN revealed thousands of connections between hundreds of diseases, and it also provided a high-level view of disease connections, including known and previously unreported disease links; therefore, to identify relevant disease connections from this dense network, we focused on three broad research goals. One of the key goals was to gain a bird’s-eye view of disease connections characterized by underlying genetic associations. More specifically, when we grouped the diseases into disease classes, we asked which diseases share strong links within a disease class, as well as across different disease classes. The second goal was to integrate functional knowledge of the genome with genetic associations to ascertain biologically relevant findings. We integrated epigenomic knowledge into the DDN and examined the changes on the basis of tissue specificity. A number of recent studies have used EHR data alone to identify disease correlations and comorbidities,14, 15 so our last goal was to explain some of the disease correlations and comorbidities due to shared genetics. We compared the PheWAS-derived DDN to a separate network of diseases identified via an orthogonal EHR-only approach without genetics. Additionally, we used network statistics to mine the DDN for clusters of diseases with known links to one another in order to generate new hypotheses. These disease connections can serve as the basis for new hypotheses to test for comorbidities and pleiotropy. With regard to testing new hypotheses, one of the most significant advantages of our approach is the single-source EHR linked to genomic data; it provides an opportunity to revisit individual-level genotype and phenotype data to design more targeted studies and ask more specific questions.
title	Introduction
p	Pleiotropy occurs when a given locus (e.g., a SNP or gene) influences two or more different phenotypes or traits. The phenome-wide association study (PheWAS) is an important tool that has the strength to identify associations between genetic variants and clinical phenotypes and also the potential to reveal pleiotropic associations among diseases.1, 2, 3, 4 Although pleiotropy often refers to a common molecular mechanism, PheWASs can identify statistical associations between a single variant and multiple phenotypes. They can also provide the basis for a statistical approach to identifying cross-phenotype associations, which can then be verified as true pleiotropic effects.1 Over the past decade, associations from hundreds of genome-wide association studies (GWASs) have accumulated in the EBI GWAS Catalog.5 Although a GWAS typically investigates a single phenotype at a time, the accumulated associations from many studies (such as those in the EBI GWAS Catalog) provide the opportunity to investigate cross-phenotype associations.6, 7 More recently, PheWASs have shown success in identifying cross-phenotype associations within the same study populations.8, 9
p	Electronic health records (EHRs) are a powerful resource for studying individual outcomes via multiple longitudinal data elements, such as disease diagnoses, laboratory measures, medications, and other health-related information. EHR data have been useful in population health research; more importantly, linking EHR data with genomics data enables us to examine the genetic architecture of various disease outcomes and traits. PheWASs have been an effective tool to mine genetic associations for candidate SNPs or genome-wide variants;10 hence, PheWASs provide the ability to identify cross-phenotype associations in which one SNP is associated with multiple diseases or traits. While investigating such cross-phenotype associations at a genome-wide scale, researchers might uncover potential hidden connections between diseases, especially when two diseases share associations with two or more SNPs that are located in different regions of the genome (Figure 1). One way to examine these connections is by creating a network of diseases in which pairs of diseases are connected on the basis of their shared associations with one or more SNPs. The strength of the network approach is that it condenses the complex links between SNPs and diseases and reveals links between diseases that would be hard to identify by just looking at disease associations at a single locus, such as when one only considers cross-phenotype association with a SNP. Figure 1 Overview of Network Construction The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN).
figure	Figure 1 Overview of Network Construction The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN).
label	Figure 1
caption	Overview of Network Construction The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN).
p	Overview of Network Construction
p	The cross-phenotype associations from a PheWAS analysis were used to construct the network of diseases. In the construction of the bipartite network, diseases (represented by yellow circles) and SNPs (represented by blue triangles) formed an edge if there was an association identified between them. Then, the bipartite network projection for the diseases was used for constructing a disease-disease network (DDN).
p	Previous networks based on gene-disease associations, such as the Human Disease Network, used gene-disease associations cataloged in the Online Mendelian Inheritance in Man (OMIM) database.11 Other studies have used summary statistics from the GWAS catalog and/or the Genetic Associations Database (GAD) to investigate SNP-phenotype associations by using network-based analyses.6 However, because these networks are based on summary statistics from disparate studies they have several critical limitations. First, differences in disease phenotype definitions across different studies can impact the interpretation of the association results,12 leading to a high false positive rate. Second, in most cases, networks constructed from summary statistics are limited to providing individual-level genotype and phenotype information. This limitation can restrict the design of follow-up studies conducted on the new hypotheses derived from the network analyses.
p	In this study, we circumvented these limitations by drawing on association results from a single-source EHR that used consistent phenotype definitions and by using a single genotyping platform. We employed genetic associations between 625,325 SNPs and 541 ICD-9 (internal classification of diseases, ninth revision) diagnosis codes from a PheWAS of 38,668 unrelated individuals in Geisinger’s biobank. A disease-disease network (DDN) was constructed from the 31,017 PheWAS association results (p value < 1 × 10−4).13
p	The DDN revealed thousands of connections between hundreds of diseases, and it also provided a high-level view of disease connections, including known and previously unreported disease links; therefore, to identify relevant disease connections from this dense network, we focused on three broad research goals. One of the key goals was to gain a bird’s-eye view of disease connections characterized by underlying genetic associations. More specifically, when we grouped the diseases into disease classes, we asked which diseases share strong links within a disease class, as well as across different disease classes. The second goal was to integrate functional knowledge of the genome with genetic associations to ascertain biologically relevant findings. We integrated epigenomic knowledge into the DDN and examined the changes on the basis of tissue specificity. A number of recent studies have used EHR data alone to identify disease correlations and comorbidities,14, 15 so our last goal was to explain some of the disease correlations and comorbidities due to shared genetics. We compared the PheWAS-derived DDN to a separate network of diseases identified via an orthogonal EHR-only approach without genetics. Additionally, we used network statistics to mine the DDN for clusters of diseases with known links to one another in order to generate new hypotheses.
p	These disease connections can serve as the basis for new hypotheses to test for comorbidities and pleiotropy. With regard to testing new hypotheses, one of the most significant advantages of our approach is the single-source EHR linked to genomic data; it provides an opportunity to revisit individual-level genotype and phenotype data to design more targeted studies and ask more specific questions.
sec	Material and Methods Cross-Phenotype Associations To construct the DDN, we used the genetic associations, identified through the PheWAS approach, that were reported in in our previous study to comprehensively test for associations between 625,325 SNPs and 541 EHR-based phenotypes.13 As part of MyCode initiative, individuals agreed to provide blood and DNA samples for broad, future research, including genomic analyses as part of the Regeneron-Geisinger DiscovEHR collaboration and linking to data in the Geisinger EHR under a protocol approved by the Geisinger Institutional Review Board. The association testing was performed on genotype and phenotype data from 38,668 unrelated individuals. We used 31,017 associations with a p value < 1 × 10−4 to generate a network between disease diagnoses derived from ICD-9 phenotypes.13 Construction of the Network Disease-Disease Network In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources). To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease. Network Statistics Network statistics allow for the descriptive characterization of a network graph and the identification of meaningful connections. In this study, we applied various network analysis approaches to the DDN to identify the most crucial disease nodes, as well as to automate the extraction of disease cluster subnetworks. We used the statistical packages available as plug-ins within Gephi to perform all of the network analytics. Hub Diseases Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics. Community Detection Community detection is an approach used in network analytics to partition a large, densely connected network into smaller subnetworks.22, 23 Various community-detection methods can algorithmically identify meaningful subnetworks. These methods have most commonly been applied in social network analyses for the detection of structure in social interactions.24 We used Louvain’s method,22, 25 which is implemented in Gephi as the “modularity” feature, to partition the DDN and detect subnetworks, or communities, of diseases (see Web Resources). The communities detected had varying types of disease nodes. We used the identified disease communities to further investigate the biological interpretation of disease connections in the DDN. Tissue-Specific Functional Annotation To investigate the tissue-specific disease connections in the network, we used annotations from the 15 chromatin state models available on the Roadmap Epigenomics website to assign chromatin states to different tissues.26 Using posterior probability, we assigned the most probable chromatin states for 127 different tissues, defined via posterior probability, to every 200 base-pair window across the genome. We also consolidated the 127 different tissues into 27 functional groups of tissues; for example, we used four different adipose tissues for the chromatin-state prediction, but we consolidated these into one group called “adipose tissue.”27 To calculate the most probable chromatin state for each functional tissue category, we averaged the posterior probabilities.27 The chromatin-state prediction provides the annotations for the most active to the most quiescent regions of the non-coding genome. In this study, we focused on the active regulatory elements, such as enhancers, promoters, and active transcription start site (TSS); as a proof-of-concept, we only analyzed enhancer-state annotations. The chromosome base pair position of each SNP was mapped onto the annotated chromatin states of the 27 functional groups of tissues. We considered variants to belong inside enhancer regions when a chromosome base pair position mapped onto either of the three enhancer states: enhancer (Enh), genic enhancer (EnhG), and bivalent enhancer (EnhBiv). Then, a total of seven DDNs were constructed from the associations between SNPs in enhancer regions. For visualization, we overlaid the networks created for each tissue onto the original DDN we had constructed.
title	Material and Methods
sec	Cross-Phenotype Associations To construct the DDN, we used the genetic associations, identified through the PheWAS approach, that were reported in in our previous study to comprehensively test for associations between 625,325 SNPs and 541 EHR-based phenotypes.13 As part of MyCode initiative, individuals agreed to provide blood and DNA samples for broad, future research, including genomic analyses as part of the Regeneron-Geisinger DiscovEHR collaboration and linking to data in the Geisinger EHR under a protocol approved by the Geisinger Institutional Review Board. The association testing was performed on genotype and phenotype data from 38,668 unrelated individuals. We used 31,017 associations with a p value < 1 × 10−4 to generate a network between disease diagnoses derived from ICD-9 phenotypes.13
title	Cross-Phenotype Associations
p	To construct the DDN, we used the genetic associations, identified through the PheWAS approach, that were reported in in our previous study to comprehensively test for associations between 625,325 SNPs and 541 EHR-based phenotypes.13 As part of MyCode initiative, individuals agreed to provide blood and DNA samples for broad, future research, including genomic analyses as part of the Regeneron-Geisinger DiscovEHR collaboration and linking to data in the Geisinger EHR under a protocol approved by the Geisinger Institutional Review Board. The association testing was performed on genotype and phenotype data from 38,668 unrelated individuals. We used 31,017 associations with a p value < 1 × 10−4 to generate a network between disease diagnoses derived from ICD-9 phenotypes.13
sec	Construction of the Network Disease-Disease Network In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources). To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease.
title	Construction of the Network
sec	Disease-Disease Network In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources). To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease.
title	Disease-Disease Network
p	In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources).
p	To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease.
sec	Network Statistics Network statistics allow for the descriptive characterization of a network graph and the identification of meaningful connections. In this study, we applied various network analysis approaches to the DDN to identify the most crucial disease nodes, as well as to automate the extraction of disease cluster subnetworks. We used the statistical packages available as plug-ins within Gephi to perform all of the network analytics. Hub Diseases Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics.
title	Network Statistics
p	Network statistics allow for the descriptive characterization of a network graph and the identification of meaningful connections. In this study, we applied various network analysis approaches to the DDN to identify the most crucial disease nodes, as well as to automate the extraction of disease cluster subnetworks. We used the statistical packages available as plug-ins within Gephi to perform all of the network analytics.
sec	Hub Diseases Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics.
title	Hub Diseases
p	Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics.
sec	Community Detection Community detection is an approach used in network analytics to partition a large, densely connected network into smaller subnetworks.22, 23 Various community-detection methods can algorithmically identify meaningful subnetworks. These methods have most commonly been applied in social network analyses for the detection of structure in social interactions.24 We used Louvain’s method,22, 25 which is implemented in Gephi as the “modularity” feature, to partition the DDN and detect subnetworks, or communities, of diseases (see Web Resources). The communities detected had varying types of disease nodes. We used the identified disease communities to further investigate the biological interpretation of disease connections in the DDN.
title	Community Detection
p	Community detection is an approach used in network analytics to partition a large, densely connected network into smaller subnetworks.22, 23 Various community-detection methods can algorithmically identify meaningful subnetworks. These methods have most commonly been applied in social network analyses for the detection of structure in social interactions.24 We used Louvain’s method,22, 25 which is implemented in Gephi as the “modularity” feature, to partition the DDN and detect subnetworks, or communities, of diseases (see Web Resources). The communities detected had varying types of disease nodes. We used the identified disease communities to further investigate the biological interpretation of disease connections in the DDN.
sec	Tissue-Specific Functional Annotation To investigate the tissue-specific disease connections in the network, we used annotations from the 15 chromatin state models available on the Roadmap Epigenomics website to assign chromatin states to different tissues.26 Using posterior probability, we assigned the most probable chromatin states for 127 different tissues, defined via posterior probability, to every 200 base-pair window across the genome. We also consolidated the 127 different tissues into 27 functional groups of tissues; for example, we used four different adipose tissues for the chromatin-state prediction, but we consolidated these into one group called “adipose tissue.”27 To calculate the most probable chromatin state for each functional tissue category, we averaged the posterior probabilities.27 The chromatin-state prediction provides the annotations for the most active to the most quiescent regions of the non-coding genome. In this study, we focused on the active regulatory elements, such as enhancers, promoters, and active transcription start site (TSS); as a proof-of-concept, we only analyzed enhancer-state annotations. The chromosome base pair position of each SNP was mapped onto the annotated chromatin states of the 27 functional groups of tissues. We considered variants to belong inside enhancer regions when a chromosome base pair position mapped onto either of the three enhancer states: enhancer (Enh), genic enhancer (EnhG), and bivalent enhancer (EnhBiv). Then, a total of seven DDNs were constructed from the associations between SNPs in enhancer regions. For visualization, we overlaid the networks created for each tissue onto the original DDN we had constructed.
title	Tissue-Specific Functional Annotation
p	To investigate the tissue-specific disease connections in the network, we used annotations from the 15 chromatin state models available on the Roadmap Epigenomics website to assign chromatin states to different tissues.26 Using posterior probability, we assigned the most probable chromatin states for 127 different tissues, defined via posterior probability, to every 200 base-pair window across the genome. We also consolidated the 127 different tissues into 27 functional groups of tissues; for example, we used four different adipose tissues for the chromatin-state prediction, but we consolidated these into one group called “adipose tissue.”27 To calculate the most probable chromatin state for each functional tissue category, we averaged the posterior probabilities.27 The chromatin-state prediction provides the annotations for the most active to the most quiescent regions of the non-coding genome. In this study, we focused on the active regulatory elements, such as enhancers, promoters, and active transcription start site (TSS); as a proof-of-concept, we only analyzed enhancer-state annotations.
p	The chromosome base pair position of each SNP was mapped onto the annotated chromatin states of the 27 functional groups of tissues. We considered variants to belong inside enhancer regions when a chromosome base pair position mapped onto either of the three enhancer states: enhancer (Enh), genic enhancer (EnhG), and bivalent enhancer (EnhBiv). Then, a total of seven DDNs were constructed from the associations between SNPs in enhancer regions. For visualization, we overlaid the networks created for each tissue onto the original DDN we had constructed.
sec	Results Disease-Disease Network Using the cross-phenotype associations found in the EHR-based PheWAS analysis, we constructed a disease-disease network (DDN) in order to understand the genetic similarities between human diseases (Figure 1). The network consists of 385 ICD-9-based disease diagnoses (which we obtained from an original 541 ICD-9 codes by using a threshold of p < 1 × 10-4) acting as nodes and the 1,398 edges connecting them. As shown in Figure 2, we classified ICD-9 codes into 15 broad disease classes, labeled with different colors. The DDN provides a bird’s-eye view of the interconnections between the diseases on the basis of shared genetic associations. Many interconnections, including those between endocrine, musculoskeletal, and neurological disorders, were observed across classes. The strongest connections (indicated by the thickness of the network lines in Figure 2), which are based on the highest number of shared genetic variants, were between autoimmune disorders such as type 1 diabetes (MIM: 222100), rheumatoid arthritis (MIM: 180300), psoriasis (MIM: 177900), and multiple sclerosis (MIM: 126200) (Figure 2). These links are consistent with previous findings suggesting that these autoimmune diseases are determined by shared genetic components, indicating similar pathogenic mechanisms, even if completely different tissue types are affected in each disorder.28, 29, 30, 31 This could indicate that there are shared genetic pathways linking multiple SNPs to the same diseases. This could also be a reflection of a high correlation between disease occurrences. Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block. Diseases Connected to the Most Other Diseases Next, we focused on the disease nodes with the highest number of direct connections with other diseases in the network. The degree property (K) of the network represents the number of neighbors for each node. We observed that on average each disease shares direct links with seven other diseases (K = 7, Figure 3). With links to 32 diseases, hypothyroidism had the highest degree property (K = 32) in the network. In hypothyroidism, a disorder of the endocrine system, the thyroid gland does not produce enough thyroid hormones, and this deficiency can lead to the development of other diseases. Some comorbidities observed in the DDN were morbid obesity,32, 33 type 2 diabetes mellitus (MIM: 125853),34 vitamin D deficiency,35 hypertensive heart disease,36 thyroid cancer, and rheumatoid arthritis.37 On the other end of the scale, five diseases (blepharitis; “acute, but ill-defined, cerebrovascular disease; hyposmolality and/or hyponatremia; pain in joint; and goiter) had links to only one neighboring disease (K = 1). Thus, representing cross-phenotype associations in the form of networks enabled visualization of complex interconnections between different diseases. Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN. Hub Diseases in the DDN To further characterize the DDN, we applied different network statistics to identify disease nodes necessary for the cohesiveness of the network. Such nodes are also commonly referred to as hub nodes (see Material and Methods). We used a betweenness centrality measure to identify hub nodes, which are represented in the DDN by larger nodes (Figure 2). We identified many hub nodes in different disease classes across the DDN; the highest number were in endocrine disorders and included hypothyroidism, type 1 diabetes, and type 2 diabetes (Figure 2). Other main hub nodes that we observed in the DDN were psoriasis, morbid obesity, multiple sclerosis, rheumatoid arthritis, coronary atherosclerosis, and chronic kidney disease. Identifying Biologically Relevant Subnetworks via Epigenomics These results demonstrate that community detection is a good approach to visualizing the global and local structures of disease interaction. To further test whether the disease nodes and the connections between them are relevant to molecular mechanisms of disease, we incorporated chromatin-state annotations from the Roadmap Epigenomics Consortium and used them to extract biologically relevant subnetworks by using a similar approach. We only considered SNPs within enhancer regions for specific tissues for the current analysis. Seven tissue-specific DDNs were constructed from the shared variants in enhancer regions. The largest observed subnetwork where SNPs were in active enhancer regions was in the liver. The associated diseases for this tissue included 19 diseases, such as cirrhosis of the liver, chronic non-alcoholic liver disease, hyperlipidemia, morbid obesity, essential hypertension, and cardiovascular diseases, among others (Table S2). For adipose tissue, there were eight diseases in the subnetwork, including links between cardiovascular, nutritional, endocrine, and autoimmune diseases (Figure 4). Only two of the nodes in this subnetwork were connected to each other. Within the adipose subnetwork, we observed connections between cardiovascular diseases such as peripheral vascular disease, myocardial infarction, coronary artery disease, and abdominal aneurysm. Supporting these connections, previous studies have reported known links between increased gene expression in adipose tissue and cardiovascular diseases.24, 25 The second node was for type 1 diabetes, which had connections to psoriasis and Raynaud syndrome. Psoriasis and type 1 diabetes are both autoimmune diseases, and they share associations with the variation in the human leukocyte antigen (HLA) region. Numerous studies have identified strong connections between the pathogenesis of these autoimmune diseases and variations in HLA.38, 39 Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene. Community Detection EHR data provide a vast amount of information pertaining to diseases. Machine-learning approaches are being applied to longitudinal EHR data so that predictive models of disease correlations, risk predictions, and comorbidities can be developed.40, 41, 42 EHR-based predictive models can be used for combining disease connections into a network similar to the DDN. To compare the DDN with networks from longitudinal EHR data, we applied a probabilistic relationship model to ICD-9 diagnoses derived from the same Geisinger longitudinal EHR data (unpublished data). These prediction models were developed under an Ising model framework,43 and all the predictions were based on EHR data alone. The Ising model is a type of Markov random field (MRF) graphical model for binary data.44 It provides an approximation of the full joint-probability distribution across hundreds of ICD-9 codes. Thus, it can help to uncover patterns of dependencies between ICD-9 codes that result from either shared genetic or environmental architecture. This predictive algorithm generated a graphical model of disease states for 500 ICD-9 codes; this model is a representation of similarities between ICD-9 codes. Then we evaluated whether we observed the same links that we identified in the PheWAS-derived DDN. Rather than comparing all the disease connections, which would be computationally intensive, we applied the community-detection method in Gephi to the DDN in order to find subnetworks algorithmically. The method found nine communities; as shown in Figure 5, the number of diseases in each community varied between clusters of 2 and 102. Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community. Next, we selected one community that encompassed 20 diseases and showed connections between different disease classes, such as nutritional, neurological, cardiovascular, skin, and digestive-system disorders (Figure 6A). We compared this subnetwork of the DDN with the network derived from probabilistic graphical model of disease state, wherein disease state is defined as the status of all ICD-9 code diagnoses in an individual’s EHR. We used the Ising model framework to develop the probabilistic graphical model of disease state. We checked to see whether we could observe some of the links we identified in our DDN subnetwork (identified via community detection) in the Ising model of disease state (Figure 6B). Through this independent investigation, we identified direct and indirect connections between ICD-9 codes in the Ising model network; these connections were similar to those found in the DDN. Thus, we demonstrated a probabilistic dependence between these diagnosis codes in line with what we see in our network. When we compared the morbid obesity associated with diseases directly neighboring one another in both the DDN and the Ising model (Figure 6), we found many similarities. Specifically, the comorbidities that showed direct links to morbid obesity in both networks were sleep apnea,45 lumbago,46 and edema.47 These results suggest that the probabilistic dependencies observed between these diseases in the Ising model network can probably be explained by the shared genetic architecture that was identified through the DDN. In the DDN, we also found links between morbid obesity and cardiovascular diseases (coronary atherosclerosis and intermediate coronary syndrome), which are known comorbidities.45 Other interesting links with morbid obesity were bariatric-surgery-associated conditions, such as post-gastric absorption and post-surgical non-absorption. It is possible that these connections might be due to a diagnosis correlation that arose in the EHR when an individual underwent bariatric surgery because of their pre-existing condition of morbid obesity. Gout was also a comorbidity of morbid obesity.45 However, these diseases were connected indirectly through another comorbidity: sleep apnea. With this example, we highlight the core strength of EHR-based studies, which allow us to answer similar questions about disease relationships with different methods and thereby provide more robustness to the findings. Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).
title	Results
sec	Disease-Disease Network Using the cross-phenotype associations found in the EHR-based PheWAS analysis, we constructed a disease-disease network (DDN) in order to understand the genetic similarities between human diseases (Figure 1). The network consists of 385 ICD-9-based disease diagnoses (which we obtained from an original 541 ICD-9 codes by using a threshold of p < 1 × 10-4) acting as nodes and the 1,398 edges connecting them. As shown in Figure 2, we classified ICD-9 codes into 15 broad disease classes, labeled with different colors. The DDN provides a bird’s-eye view of the interconnections between the diseases on the basis of shared genetic associations. Many interconnections, including those between endocrine, musculoskeletal, and neurological disorders, were observed across classes. The strongest connections (indicated by the thickness of the network lines in Figure 2), which are based on the highest number of shared genetic variants, were between autoimmune disorders such as type 1 diabetes (MIM: 222100), rheumatoid arthritis (MIM: 180300), psoriasis (MIM: 177900), and multiple sclerosis (MIM: 126200) (Figure 2). These links are consistent with previous findings suggesting that these autoimmune diseases are determined by shared genetic components, indicating similar pathogenic mechanisms, even if completely different tissue types are affected in each disorder.28, 29, 30, 31 This could indicate that there are shared genetic pathways linking multiple SNPs to the same diseases. This could also be a reflection of a high correlation between disease occurrences. Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block.
title	Disease-Disease Network
p	Using the cross-phenotype associations found in the EHR-based PheWAS analysis, we constructed a disease-disease network (DDN) in order to understand the genetic similarities between human diseases (Figure 1). The network consists of 385 ICD-9-based disease diagnoses (which we obtained from an original 541 ICD-9 codes by using a threshold of p < 1 × 10-4) acting as nodes and the 1,398 edges connecting them. As shown in Figure 2, we classified ICD-9 codes into 15 broad disease classes, labeled with different colors. The DDN provides a bird’s-eye view of the interconnections between the diseases on the basis of shared genetic associations. Many interconnections, including those between endocrine, musculoskeletal, and neurological disorders, were observed across classes. The strongest connections (indicated by the thickness of the network lines in Figure 2), which are based on the highest number of shared genetic variants, were between autoimmune disorders such as type 1 diabetes (MIM: 222100), rheumatoid arthritis (MIM: 180300), psoriasis (MIM: 177900), and multiple sclerosis (MIM: 126200) (Figure 2). These links are consistent with previous findings suggesting that these autoimmune diseases are determined by shared genetic components, indicating similar pathogenic mechanisms, even if completely different tissue types are affected in each disorder.28, 29, 30, 31 This could indicate that there are shared genetic pathways linking multiple SNPs to the same diseases. This could also be a reflection of a high correlation between disease occurrences. Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block.
figure	Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block.
label	Figure 2
caption	Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block.
p	Disease-Disease Network
p	Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block.
sec	Diseases Connected to the Most Other Diseases Next, we focused on the disease nodes with the highest number of direct connections with other diseases in the network. The degree property (K) of the network represents the number of neighbors for each node. We observed that on average each disease shares direct links with seven other diseases (K = 7, Figure 3). With links to 32 diseases, hypothyroidism had the highest degree property (K = 32) in the network. In hypothyroidism, a disorder of the endocrine system, the thyroid gland does not produce enough thyroid hormones, and this deficiency can lead to the development of other diseases. Some comorbidities observed in the DDN were morbid obesity,32, 33 type 2 diabetes mellitus (MIM: 125853),34 vitamin D deficiency,35 hypertensive heart disease,36 thyroid cancer, and rheumatoid arthritis.37 On the other end of the scale, five diseases (blepharitis; “acute, but ill-defined, cerebrovascular disease; hyposmolality and/or hyponatremia; pain in joint; and goiter) had links to only one neighboring disease (K = 1). Thus, representing cross-phenotype associations in the form of networks enabled visualization of complex interconnections between different diseases. Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN.
title	Diseases Connected to the Most Other Diseases
p	Next, we focused on the disease nodes with the highest number of direct connections with other diseases in the network. The degree property (K) of the network represents the number of neighbors for each node. We observed that on average each disease shares direct links with seven other diseases (K = 7, Figure 3). With links to 32 diseases, hypothyroidism had the highest degree property (K = 32) in the network. In hypothyroidism, a disorder of the endocrine system, the thyroid gland does not produce enough thyroid hormones, and this deficiency can lead to the development of other diseases. Some comorbidities observed in the DDN were morbid obesity,32, 33 type 2 diabetes mellitus (MIM: 125853),34 vitamin D deficiency,35 hypertensive heart disease,36 thyroid cancer, and rheumatoid arthritis.37 On the other end of the scale, five diseases (blepharitis; “acute, but ill-defined, cerebrovascular disease; hyposmolality and/or hyponatremia; pain in joint; and goiter) had links to only one neighboring disease (K = 1). Thus, representing cross-phenotype associations in the form of networks enabled visualization of complex interconnections between different diseases. Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN.
figure	Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN.
label	Figure 3
caption	Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN.
p	Disease Neighbors
p	In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN.
sec	Hub Diseases in the DDN To further characterize the DDN, we applied different network statistics to identify disease nodes necessary for the cohesiveness of the network. Such nodes are also commonly referred to as hub nodes (see Material and Methods). We used a betweenness centrality measure to identify hub nodes, which are represented in the DDN by larger nodes (Figure 2). We identified many hub nodes in different disease classes across the DDN; the highest number were in endocrine disorders and included hypothyroidism, type 1 diabetes, and type 2 diabetes (Figure 2). Other main hub nodes that we observed in the DDN were psoriasis, morbid obesity, multiple sclerosis, rheumatoid arthritis, coronary atherosclerosis, and chronic kidney disease.
title	Hub Diseases in the DDN
p	To further characterize the DDN, we applied different network statistics to identify disease nodes necessary for the cohesiveness of the network. Such nodes are also commonly referred to as hub nodes (see Material and Methods). We used a betweenness centrality measure to identify hub nodes, which are represented in the DDN by larger nodes (Figure 2). We identified many hub nodes in different disease classes across the DDN; the highest number were in endocrine disorders and included hypothyroidism, type 1 diabetes, and type 2 diabetes (Figure 2). Other main hub nodes that we observed in the DDN were psoriasis, morbid obesity, multiple sclerosis, rheumatoid arthritis, coronary atherosclerosis, and chronic kidney disease.
sec	Identifying Biologically Relevant Subnetworks via Epigenomics These results demonstrate that community detection is a good approach to visualizing the global and local structures of disease interaction. To further test whether the disease nodes and the connections between them are relevant to molecular mechanisms of disease, we incorporated chromatin-state annotations from the Roadmap Epigenomics Consortium and used them to extract biologically relevant subnetworks by using a similar approach. We only considered SNPs within enhancer regions for specific tissues for the current analysis. Seven tissue-specific DDNs were constructed from the shared variants in enhancer regions. The largest observed subnetwork where SNPs were in active enhancer regions was in the liver. The associated diseases for this tissue included 19 diseases, such as cirrhosis of the liver, chronic non-alcoholic liver disease, hyperlipidemia, morbid obesity, essential hypertension, and cardiovascular diseases, among others (Table S2). For adipose tissue, there were eight diseases in the subnetwork, including links between cardiovascular, nutritional, endocrine, and autoimmune diseases (Figure 4). Only two of the nodes in this subnetwork were connected to each other. Within the adipose subnetwork, we observed connections between cardiovascular diseases such as peripheral vascular disease, myocardial infarction, coronary artery disease, and abdominal aneurysm. Supporting these connections, previous studies have reported known links between increased gene expression in adipose tissue and cardiovascular diseases.24, 25 The second node was for type 1 diabetes, which had connections to psoriasis and Raynaud syndrome. Psoriasis and type 1 diabetes are both autoimmune diseases, and they share associations with the variation in the human leukocyte antigen (HLA) region. Numerous studies have identified strong connections between the pathogenesis of these autoimmune diseases and variations in HLA.38, 39 Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene.
title	Identifying Biologically Relevant Subnetworks via Epigenomics
p	These results demonstrate that community detection is a good approach to visualizing the global and local structures of disease interaction. To further test whether the disease nodes and the connections between them are relevant to molecular mechanisms of disease, we incorporated chromatin-state annotations from the Roadmap Epigenomics Consortium and used them to extract biologically relevant subnetworks by using a similar approach. We only considered SNPs within enhancer regions for specific tissues for the current analysis. Seven tissue-specific DDNs were constructed from the shared variants in enhancer regions. The largest observed subnetwork where SNPs were in active enhancer regions was in the liver. The associated diseases for this tissue included 19 diseases, such as cirrhosis of the liver, chronic non-alcoholic liver disease, hyperlipidemia, morbid obesity, essential hypertension, and cardiovascular diseases, among others (Table S2). For adipose tissue, there were eight diseases in the subnetwork, including links between cardiovascular, nutritional, endocrine, and autoimmune diseases (Figure 4). Only two of the nodes in this subnetwork were connected to each other. Within the adipose subnetwork, we observed connections between cardiovascular diseases such as peripheral vascular disease, myocardial infarction, coronary artery disease, and abdominal aneurysm. Supporting these connections, previous studies have reported known links between increased gene expression in adipose tissue and cardiovascular diseases.24, 25 The second node was for type 1 diabetes, which had connections to psoriasis and Raynaud syndrome. Psoriasis and type 1 diabetes are both autoimmune diseases, and they share associations with the variation in the human leukocyte antigen (HLA) region. Numerous studies have identified strong connections between the pathogenesis of these autoimmune diseases and variations in HLA.38, 39 Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene.
figure	Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene.
label	Figure 4
caption	Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene.
p	Diseases with Shared Enhancers in Adipose Tissue
p	The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene.
sec	Community Detection EHR data provide a vast amount of information pertaining to diseases. Machine-learning approaches are being applied to longitudinal EHR data so that predictive models of disease correlations, risk predictions, and comorbidities can be developed.40, 41, 42 EHR-based predictive models can be used for combining disease connections into a network similar to the DDN. To compare the DDN with networks from longitudinal EHR data, we applied a probabilistic relationship model to ICD-9 diagnoses derived from the same Geisinger longitudinal EHR data (unpublished data). These prediction models were developed under an Ising model framework,43 and all the predictions were based on EHR data alone. The Ising model is a type of Markov random field (MRF) graphical model for binary data.44 It provides an approximation of the full joint-probability distribution across hundreds of ICD-9 codes. Thus, it can help to uncover patterns of dependencies between ICD-9 codes that result from either shared genetic or environmental architecture. This predictive algorithm generated a graphical model of disease states for 500 ICD-9 codes; this model is a representation of similarities between ICD-9 codes. Then we evaluated whether we observed the same links that we identified in the PheWAS-derived DDN. Rather than comparing all the disease connections, which would be computationally intensive, we applied the community-detection method in Gephi to the DDN in order to find subnetworks algorithmically. The method found nine communities; as shown in Figure 5, the number of diseases in each community varied between clusters of 2 and 102. Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community. Next, we selected one community that encompassed 20 diseases and showed connections between different disease classes, such as nutritional, neurological, cardiovascular, skin, and digestive-system disorders (Figure 6A). We compared this subnetwork of the DDN with the network derived from probabilistic graphical model of disease state, wherein disease state is defined as the status of all ICD-9 code diagnoses in an individual’s EHR. We used the Ising model framework to develop the probabilistic graphical model of disease state. We checked to see whether we could observe some of the links we identified in our DDN subnetwork (identified via community detection) in the Ising model of disease state (Figure 6B). Through this independent investigation, we identified direct and indirect connections between ICD-9 codes in the Ising model network; these connections were similar to those found in the DDN. Thus, we demonstrated a probabilistic dependence between these diagnosis codes in line with what we see in our network. When we compared the morbid obesity associated with diseases directly neighboring one another in both the DDN and the Ising model (Figure 6), we found many similarities. Specifically, the comorbidities that showed direct links to morbid obesity in both networks were sleep apnea,45 lumbago,46 and edema.47 These results suggest that the probabilistic dependencies observed between these diseases in the Ising model network can probably be explained by the shared genetic architecture that was identified through the DDN. In the DDN, we also found links between morbid obesity and cardiovascular diseases (coronary atherosclerosis and intermediate coronary syndrome), which are known comorbidities.45 Other interesting links with morbid obesity were bariatric-surgery-associated conditions, such as post-gastric absorption and post-surgical non-absorption. It is possible that these connections might be due to a diagnosis correlation that arose in the EHR when an individual underwent bariatric surgery because of their pre-existing condition of morbid obesity. Gout was also a comorbidity of morbid obesity.45 However, these diseases were connected indirectly through another comorbidity: sleep apnea. With this example, we highlight the core strength of EHR-based studies, which allow us to answer similar questions about disease relationships with different methods and thereby provide more robustness to the findings. Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).
title	Community Detection
p	EHR data provide a vast amount of information pertaining to diseases. Machine-learning approaches are being applied to longitudinal EHR data so that predictive models of disease correlations, risk predictions, and comorbidities can be developed.40, 41, 42 EHR-based predictive models can be used for combining disease connections into a network similar to the DDN. To compare the DDN with networks from longitudinal EHR data, we applied a probabilistic relationship model to ICD-9 diagnoses derived from the same Geisinger longitudinal EHR data (unpublished data). These prediction models were developed under an Ising model framework,43 and all the predictions were based on EHR data alone. The Ising model is a type of Markov random field (MRF) graphical model for binary data.44 It provides an approximation of the full joint-probability distribution across hundreds of ICD-9 codes. Thus, it can help to uncover patterns of dependencies between ICD-9 codes that result from either shared genetic or environmental architecture. This predictive algorithm generated a graphical model of disease states for 500 ICD-9 codes; this model is a representation of similarities between ICD-9 codes. Then we evaluated whether we observed the same links that we identified in the PheWAS-derived DDN.
p	Rather than comparing all the disease connections, which would be computationally intensive, we applied the community-detection method in Gephi to the DDN in order to find subnetworks algorithmically. The method found nine communities; as shown in Figure 5, the number of diseases in each community varied between clusters of 2 and 102. Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community.
figure	Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community.
label	Figure 5
caption	Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community.
p	Disease Communities
p	The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community.
p	Next, we selected one community that encompassed 20 diseases and showed connections between different disease classes, such as nutritional, neurological, cardiovascular, skin, and digestive-system disorders (Figure 6A). We compared this subnetwork of the DDN with the network derived from probabilistic graphical model of disease state, wherein disease state is defined as the status of all ICD-9 code diagnoses in an individual’s EHR. We used the Ising model framework to develop the probabilistic graphical model of disease state. We checked to see whether we could observe some of the links we identified in our DDN subnetwork (identified via community detection) in the Ising model of disease state (Figure 6B). Through this independent investigation, we identified direct and indirect connections between ICD-9 codes in the Ising model network; these connections were similar to those found in the DDN. Thus, we demonstrated a probabilistic dependence between these diagnosis codes in line with what we see in our network. When we compared the morbid obesity associated with diseases directly neighboring one another in both the DDN and the Ising model (Figure 6), we found many similarities. Specifically, the comorbidities that showed direct links to morbid obesity in both networks were sleep apnea,45 lumbago,46 and edema.47 These results suggest that the probabilistic dependencies observed between these diseases in the Ising model network can probably be explained by the shared genetic architecture that was identified through the DDN. In the DDN, we also found links between morbid obesity and cardiovascular diseases (coronary atherosclerosis and intermediate coronary syndrome), which are known comorbidities.45 Other interesting links with morbid obesity were bariatric-surgery-associated conditions, such as post-gastric absorption and post-surgical non-absorption. It is possible that these connections might be due to a diagnosis correlation that arose in the EHR when an individual underwent bariatric surgery because of their pre-existing condition of morbid obesity. Gout was also a comorbidity of morbid obesity.45 However, these diseases were connected indirectly through another comorbidity: sleep apnea. With this example, we highlight the core strength of EHR-based studies, which allow us to answer similar questions about disease relationships with different methods and thereby provide more robustness to the findings. Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).
figure	Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).
label	Figure 6
caption	Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).
p	Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches
p	The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).
sec	Discussion In this study, we generated and evaluated a network of cross-phenotype associations derived from an EHR-based PheWAS. In contrast to previous disease networks, which were built of summary statistics from disparate studies, the DDN benefits from utilizing a single source of EHR data. The network analyses performed on the DDN have illuminated deeper structures within and across disease classes. For example, autoimmune diseases are caused by dysfunctional immune systems that attack the healthy cells in a variety of organs. Type 1 diabetes, rheumatoid arthritis, and multiple sclerosis were some of the common autoimmune conditions within the DDN. Although these conditions have distinct symptoms, previous findings have shown strong evidence that complex interactions occur between these diseases as a result of shared genetic architecture.48, 49 The identification of these previously known findings regarding these autoimmune diseases provides support for the network approach of investigating cross-phenotype associations derived from PheWASs. In this study, the SNPs linking these autoimmune diseases mapped to 19 genes, variations in all of which were associated with increased risk of autoimmune disease (Table S2). Two genes, C6orf10 (chromosome 6 open reading frame 10 [MIM: 618151]) and TAP2 (transporter 2, ATP-binding cassette, subfamily B [MIM: 170261]), were the only two genes linked to three autoimmune diseases: type 1 diabetes, rheumatoid arthritis, and multiple sclerosis. Of the 19 genes, C2 (complement component 2 [MIM: 613927]), HCG26 (HLA complex group 26 [HGNC: 29671]), and PSMB8 (proteasome subunit beta 8 [MIM: 177046]) had no previously known associations with autoimmune diseases. However, we replicated the findings of a genetic study of one of the largest European American cohorts (UK Biobank), which revealed associations between rheumatoid arthritis and multiple sclerosis.54 Additionally, we performed a gene ontology (GO) enrichment analysis with genes shared between type 1 diabetes, multiple sclerosis, and rheumatoid arthritis. Notably, many immune-system-process-related GO terms were identified (Table S3). Using epigenomics, we found that a variant in HCG26, one of the 19 genes, is located in the enhancer region targeting LTA (lymphotoxin alpha [MIM: 153440]); the variant was identified in multiple tissues by the fine mapping approach described in Verma et al..13 (dbSNP: rs2523663). LTA is a protein-coding gene that encodes cytokines produced by lymphocytes in the immune system (see NCBI in Web Resources). Cytokines play an important role in the pathogenesis of various autoimmune disorders, and cytokine-inhibiting agents are key drug targets for type 1 diabetes and multiple sclerosis.50, 51, 52, 53 Because of the many key genes shared between connected diseases, along with the epigenetic regulation, cytokine-inhibiting agents may offer intervention strategies to satisfy the unmet medical needs that still exist in those connected diseases. Additionally, we identified previously unreported disease connections by using the DDN approach. For example, we found that links between morbid obesity and its known comorbidities can be explained by shared genetic associations. These comorbidities were not present in the Human Disease Network (Figure S1). This inconsistency might be explained by differences in the phenotypes used to construct the network. We also demonstrated similarities between networks formed from two distinct predictive algorithms from the same EHR system. Taken together, these results suggest that the probabilistic dependencies observed between certain diseases (e.g., morbid obesity, sleep apnea, lumbago, gout, venous insufficiency, and edema) in the Ising model can be explained by shared genetic architecture identified via our disease-disease network. With this example, we highlight the core strength of EHR-based studies: the ability to apply different approaches, such as using genetic and/or phenotypic information, in order to arrive at a stronger conclusion. The potential strength of the DDN is to identify disease connections that were not expected. From the DDN generated in this study, we found that hyperlipidemia was linked to not only atherosclerosis, but also many immune-related diseases, such as type I diabetes, psoriasis, hypothyroidism, and multiple sclerosis, as well as other immune-mediated diseases, such as allergic rhinitis, blepharitis, acute bronchitis, and herpes. These unexpected observations indicate the non-canonical role of the immune system in lipid-metabolizing disorders and/or the pathogenic role of hyperlipidemia in immune responses. Indeed, lymphotoxin (LT) and LIGHT, two tumor necrosis factor cytokine family members that are primarily expressed on lymphocytes, are critical regulators of key enzymes that control lipid metabolism in mouse models.54 Although further studies are warranted to infer the causality of these associations, our DNN confirmed the shared genetic risk of hyperlipidemia and immune diseases. In conclusion, community detection is a powerful method to identify and visualize cross-phenotype associations from analysis of PheWASs. It uncovered previously unreported shared links in known interactions between diseases, as well as other unreported connections between diseases. It also provided a way to generate new hypotheses to guide further targeted investigation into comorbidities, pleiotropy, and epistasis. Although we explored interconnections between multiple diseases in an EHR-based population, this approach can also be applied to many publicly available resources with summary-level and individual-level data on multiple phenotypes. Networks similar to the ones generated here could be adapted from NHANES, UK BioBank,55 GERA,56 eMERGE,57 and the Million Veteran Program, among other populations. Furthermore, we plan to extend the network analysis by including associations between genetic variants and clinical laboratory measures in EHR. This work provides new avenues by which network-based methods can be applied to large, gene-trait-based studies to uncover the genetic underpinnings of disease. Lastly, an interactive visualization tool of the disease-disease network is available (see Web Resources).
title	Discussion
p	In this study, we generated and evaluated a network of cross-phenotype associations derived from an EHR-based PheWAS. In contrast to previous disease networks, which were built of summary statistics from disparate studies, the DDN benefits from utilizing a single source of EHR data. The network analyses performed on the DDN have illuminated deeper structures within and across disease classes. For example, autoimmune diseases are caused by dysfunctional immune systems that attack the healthy cells in a variety of organs. Type 1 diabetes, rheumatoid arthritis, and multiple sclerosis were some of the common autoimmune conditions within the DDN. Although these conditions have distinct symptoms, previous findings have shown strong evidence that complex interactions occur between these diseases as a result of shared genetic architecture.48, 49 The identification of these previously known findings regarding these autoimmune diseases provides support for the network approach of investigating cross-phenotype associations derived from PheWASs.
p	In this study, the SNPs linking these autoimmune diseases mapped to 19 genes, variations in all of which were associated with increased risk of autoimmune disease (Table S2). Two genes, C6orf10 (chromosome 6 open reading frame 10 [MIM: 618151]) and TAP2 (transporter 2, ATP-binding cassette, subfamily B [MIM: 170261]), were the only two genes linked to three autoimmune diseases: type 1 diabetes, rheumatoid arthritis, and multiple sclerosis. Of the 19 genes, C2 (complement component 2 [MIM: 613927]), HCG26 (HLA complex group 26 [HGNC: 29671]), and PSMB8 (proteasome subunit beta 8 [MIM: 177046]) had no previously known associations with autoimmune diseases. However, we replicated the findings of a genetic study of one of the largest European American cohorts (UK Biobank), which revealed associations between rheumatoid arthritis and multiple sclerosis.54 Additionally, we performed a gene ontology (GO) enrichment analysis with genes shared between type 1 diabetes, multiple sclerosis, and rheumatoid arthritis. Notably, many immune-system-process-related GO terms were identified (Table S3). Using epigenomics, we found that a variant in HCG26, one of the 19 genes, is located in the enhancer region targeting LTA (lymphotoxin alpha [MIM: 153440]); the variant was identified in multiple tissues by the fine mapping approach described in Verma et al..13 (dbSNP: rs2523663). LTA is a protein-coding gene that encodes cytokines produced by lymphocytes in the immune system (see NCBI in Web Resources). Cytokines play an important role in the pathogenesis of various autoimmune disorders, and cytokine-inhibiting agents are key drug targets for type 1 diabetes and multiple sclerosis.50, 51, 52, 53 Because of the many key genes shared between connected diseases, along with the epigenetic regulation, cytokine-inhibiting agents may offer intervention strategies to satisfy the unmet medical needs that still exist in those connected diseases.
p	Additionally, we identified previously unreported disease connections by using the DDN approach. For example, we found that links between morbid obesity and its known comorbidities can be explained by shared genetic associations. These comorbidities were not present in the Human Disease Network (Figure S1). This inconsistency might be explained by differences in the phenotypes used to construct the network. We also demonstrated similarities between networks formed from two distinct predictive algorithms from the same EHR system. Taken together, these results suggest that the probabilistic dependencies observed between certain diseases (e.g., morbid obesity, sleep apnea, lumbago, gout, venous insufficiency, and edema) in the Ising model can be explained by shared genetic architecture identified via our disease-disease network. With this example, we highlight the core strength of EHR-based studies: the ability to apply different approaches, such as using genetic and/or phenotypic information, in order to arrive at a stronger conclusion.
p	The potential strength of the DDN is to identify disease connections that were not expected. From the DDN generated in this study, we found that hyperlipidemia was linked to not only atherosclerosis, but also many immune-related diseases, such as type I diabetes, psoriasis, hypothyroidism, and multiple sclerosis, as well as other immune-mediated diseases, such as allergic rhinitis, blepharitis, acute bronchitis, and herpes. These unexpected observations indicate the non-canonical role of the immune system in lipid-metabolizing disorders and/or the pathogenic role of hyperlipidemia in immune responses. Indeed, lymphotoxin (LT) and LIGHT, two tumor necrosis factor cytokine family members that are primarily expressed on lymphocytes, are critical regulators of key enzymes that control lipid metabolism in mouse models.54 Although further studies are warranted to infer the causality of these associations, our DNN confirmed the shared genetic risk of hyperlipidemia and immune diseases.
p	In conclusion, community detection is a powerful method to identify and visualize cross-phenotype associations from analysis of PheWASs. It uncovered previously unreported shared links in known interactions between diseases, as well as other unreported connections between diseases. It also provided a way to generate new hypotheses to guide further targeted investigation into comorbidities, pleiotropy, and epistasis. Although we explored interconnections between multiple diseases in an EHR-based population, this approach can also be applied to many publicly available resources with summary-level and individual-level data on multiple phenotypes. Networks similar to the ones generated here could be adapted from NHANES, UK BioBank,55 GERA,56 eMERGE,57 and the Million Veteran Program, among other populations. Furthermore, we plan to extend the network analysis by including associations between genetic variants and clinical laboratory measures in EHR. This work provides new avenues by which network-based methods can be applied to large, gene-trait-based studies to uncover the genetic underpinnings of disease.
p	Lastly, an interactive visualization tool of the disease-disease network is available (see Web Resources).
sec	Declaration of Interests The authors declare no competing interests.
title	Declaration of Interests
p	The authors declare no competing interests.
back	Web Resources Disease-Disease Network Visualization Tool, https://www.biomedinfolab.com/software eMERGE, https://emerge.mc.vanderbilt.edu Gephi, https://gephi.org Million Veteran Program, https://www.research.va.gov/mvp/ NCBI, https://www.ncbi.nlm.nih.gov/gene/4049 NHANES, https://www.cdc.gov/nchs/nhanes/index.htm OMIM, http://www.omim.org/ Roadmap Epigenomics Project, http://www.roadmapepigenomics.org/data/ UK BioBank, https://www.ukbiobank.ac.uk Supplemental Data Document S1. Figures S1 and Tables S2 and S3 Table S1. Summary Information of Disease Pairs and the Number of Shared SNP Associations Used for Creating the Disease-Disease Network Document S2. Article plus Supplemental Data Acknowledgments This work was supported by the National Library of Medicine (NLM) R01 NL012535. This project is also funded, in part, by a grant provided by the Pennsylvania Department of Health (#SAP 4100070267). The Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions. Supplemental Data include one figure and three tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.11.006.
sec	Web Resources Disease-Disease Network Visualization Tool, https://www.biomedinfolab.com/software eMERGE, https://emerge.mc.vanderbilt.edu Gephi, https://gephi.org Million Veteran Program, https://www.research.va.gov/mvp/ NCBI, https://www.ncbi.nlm.nih.gov/gene/4049 NHANES, https://www.cdc.gov/nchs/nhanes/index.htm OMIM, http://www.omim.org/ Roadmap Epigenomics Project, http://www.roadmapepigenomics.org/data/ UK BioBank, https://www.ukbiobank.ac.uk
title	Web Resources
p	Disease-Disease Network Visualization Tool, https://www.biomedinfolab.com/software eMERGE, https://emerge.mc.vanderbilt.edu Gephi, https://gephi.org Million Veteran Program, https://www.research.va.gov/mvp/ NCBI, https://www.ncbi.nlm.nih.gov/gene/4049 NHANES, https://www.cdc.gov/nchs/nhanes/index.htm OMIM, http://www.omim.org/ Roadmap Epigenomics Project, http://www.roadmapepigenomics.org/data/ UK BioBank, https://www.ukbiobank.ac.uk
p	Disease-Disease Network Visualization Tool, https://www.biomedinfolab.com/software
p	eMERGE, https://emerge.mc.vanderbilt.edu
p	Gephi, https://gephi.org
p	Million Veteran Program, https://www.research.va.gov/mvp/
p	NCBI, https://www.ncbi.nlm.nih.gov/gene/4049
p	NHANES, https://www.cdc.gov/nchs/nhanes/index.htm
p	OMIM, http://www.omim.org/
p	Roadmap Epigenomics Project, http://www.roadmapepigenomics.org/data/
p	UK BioBank, https://www.ukbiobank.ac.uk
sec	Supplemental Data Document S1. Figures S1 and Tables S2 and S3 Table S1. Summary Information of Disease Pairs and the Number of Shared SNP Associations Used for Creating the Disease-Disease Network Document S2. Article plus Supplemental Data
title	Supplemental Data
p	Document S1. Figures S1 and Tables S2 and S3 Table S1. Summary Information of Disease Pairs and the Number of Shared SNP Associations Used for Creating the Disease-Disease Network Document S2. Article plus Supplemental Data
caption	Document S1. Figures S1 and Tables S2 and S3
title	Document S1. Figures S1 and Tables S2 and S3
caption	Table S1. Summary Information of Disease Pairs and the Number of Shared SNP Associations Used for Creating the Disease-Disease Network
title	Table S1. Summary Information of Disease Pairs and the Number of Shared SNP Associations Used for Creating the Disease-Disease Network
caption	Document S2. Article plus Supplemental Data
title	Document S2. Article plus Supplemental Data
ack	Acknowledgments This work was supported by the National Library of Medicine (NLM) R01 NL012535. This project is also funded, in part, by a grant provided by the Pennsylvania Department of Health (#SAP 4100070267). The Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions.
title	Acknowledgments
p	This work was supported by the National Library of Medicine (NLM) R01 NL012535. This project is also funded, in part, by a grant provided by the Pennsylvania Department of Health (#SAP 4100070267). The Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions.
footnote	Supplemental Data include one figure and three tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.11.006.
p	Supplemental Data include one figure and three tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.11.006.

projects that have annotations to this span

Unselected / annnotation		Selected / annnotation
2_test (3)

TAB JSON ListView MergeView

PMC:6323551 / 4928-5884 JSONTXT

Document structure show

projects that have annotations to this span

PMC:6323551 / 4928-5884 JSON TXT