Results Disease-Disease Network Using the cross-phenotype associations found in the EHR-based PheWAS analysis, we constructed a disease-disease network (DDN) in order to understand the genetic similarities between human diseases (Figure 1). The network consists of 385 ICD-9-based disease diagnoses (which we obtained from an original 541 ICD-9 codes by using a threshold of p < 1 × 10-4) acting as nodes and the 1,398 edges connecting them. As shown in Figure 2, we classified ICD-9 codes into 15 broad disease classes, labeled with different colors. The DDN provides a bird’s-eye view of the interconnections between the diseases on the basis of shared genetic associations. Many interconnections, including those between endocrine, musculoskeletal, and neurological disorders, were observed across classes. The strongest connections (indicated by the thickness of the network lines in Figure 2), which are based on the highest number of shared genetic variants, were between autoimmune disorders such as type 1 diabetes (MIM: 222100), rheumatoid arthritis (MIM: 180300), psoriasis (MIM: 177900), and multiple sclerosis (MIM: 126200) (Figure 2). These links are consistent with previous findings suggesting that these autoimmune diseases are determined by shared genetic components, indicating similar pathogenic mechanisms, even if completely different tissue types are affected in each disorder.28, 29, 30, 31 This could indicate that there are shared genetic pathways linking multiple SNPs to the same diseases. This could also be a reflection of a high correlation between disease occurrences. Figure 2 Disease-Disease Network Using the cross-phenotype associations from an EHR-based PheWAS, we generated the disease-disease network (DDN). In this network, nodes represent the diseases, and the edges (lines) between the nodes represent shared genetic associations between pairs of diseases. The color of the node represents the broader disease category to which it belongs. The size of the node indicates the importance of the node in the network; importance was based on the betweenness centrality measure. The bigger nodes have higher betweenness centrality, and these nodes are referred to as hub nodes. The width of the edges (lines) represents the number of shared variants or variants in an LD block. Diseases Connected to the Most Other Diseases Next, we focused on the disease nodes with the highest number of direct connections with other diseases in the network. The degree property (K) of the network represents the number of neighbors for each node. We observed that on average each disease shares direct links with seven other diseases (K = 7, Figure 3). With links to 32 diseases, hypothyroidism had the highest degree property (K = 32) in the network. In hypothyroidism, a disorder of the endocrine system, the thyroid gland does not produce enough thyroid hormones, and this deficiency can lead to the development of other diseases. Some comorbidities observed in the DDN were morbid obesity,32, 33 type 2 diabetes mellitus (MIM: 125853),34 vitamin D deficiency,35 hypertensive heart disease,36 thyroid cancer, and rheumatoid arthritis.37 On the other end of the scale, five diseases (blepharitis; “acute, but ill-defined, cerebrovascular disease; hyposmolality and/or hyponatremia; pain in joint; and goiter) had links to only one neighboring disease (K = 1). Thus, representing cross-phenotype associations in the form of networks enabled visualization of complex interconnections between different diseases. Figure 3 Disease Neighbors In a network, the degree property is the number of direct connections between one node and other nodes. This plot presents the distribution of degrees observed in the DDN. Hub Diseases in the DDN To further characterize the DDN, we applied different network statistics to identify disease nodes necessary for the cohesiveness of the network. Such nodes are also commonly referred to as hub nodes (see Material and Methods). We used a betweenness centrality measure to identify hub nodes, which are represented in the DDN by larger nodes (Figure 2). We identified many hub nodes in different disease classes across the DDN; the highest number were in endocrine disorders and included hypothyroidism, type 1 diabetes, and type 2 diabetes (Figure 2). Other main hub nodes that we observed in the DDN were psoriasis, morbid obesity, multiple sclerosis, rheumatoid arthritis, coronary atherosclerosis, and chronic kidney disease. Identifying Biologically Relevant Subnetworks via Epigenomics These results demonstrate that community detection is a good approach to visualizing the global and local structures of disease interaction. To further test whether the disease nodes and the connections between them are relevant to molecular mechanisms of disease, we incorporated chromatin-state annotations from the Roadmap Epigenomics Consortium and used them to extract biologically relevant subnetworks by using a similar approach. We only considered SNPs within enhancer regions for specific tissues for the current analysis. Seven tissue-specific DDNs were constructed from the shared variants in enhancer regions. The largest observed subnetwork where SNPs were in active enhancer regions was in the liver. The associated diseases for this tissue included 19 diseases, such as cirrhosis of the liver, chronic non-alcoholic liver disease, hyperlipidemia, morbid obesity, essential hypertension, and cardiovascular diseases, among others (Table S2). For adipose tissue, there were eight diseases in the subnetwork, including links between cardiovascular, nutritional, endocrine, and autoimmune diseases (Figure 4). Only two of the nodes in this subnetwork were connected to each other. Within the adipose subnetwork, we observed connections between cardiovascular diseases such as peripheral vascular disease, myocardial infarction, coronary artery disease, and abdominal aneurysm. Supporting these connections, previous studies have reported known links between increased gene expression in adipose tissue and cardiovascular diseases.24, 25 The second node was for type 1 diabetes, which had connections to psoriasis and Raynaud syndrome. Psoriasis and type 1 diabetes are both autoimmune diseases, and they share associations with the variation in the human leukocyte antigen (HLA) region. Numerous studies have identified strong connections between the pathogenesis of these autoimmune diseases and variations in HLA.38, 39 Figure 4 Diseases with Shared Enhancers in Adipose Tissue The highlighting of disease nodes in the network indicates that the shared SNPs between these diseases are located in the enhancer region of the nearby gene. Community Detection EHR data provide a vast amount of information pertaining to diseases. Machine-learning approaches are being applied to longitudinal EHR data so that predictive models of disease correlations, risk predictions, and comorbidities can be developed.40, 41, 42 EHR-based predictive models can be used for combining disease connections into a network similar to the DDN. To compare the DDN with networks from longitudinal EHR data, we applied a probabilistic relationship model to ICD-9 diagnoses derived from the same Geisinger longitudinal EHR data (unpublished data). These prediction models were developed under an Ising model framework,43 and all the predictions were based on EHR data alone. The Ising model is a type of Markov random field (MRF) graphical model for binary data.44 It provides an approximation of the full joint-probability distribution across hundreds of ICD-9 codes. Thus, it can help to uncover patterns of dependencies between ICD-9 codes that result from either shared genetic or environmental architecture. This predictive algorithm generated a graphical model of disease states for 500 ICD-9 codes; this model is a representation of similarities between ICD-9 codes. Then we evaluated whether we observed the same links that we identified in the PheWAS-derived DDN. Rather than comparing all the disease connections, which would be computationally intensive, we applied the community-detection method in Gephi to the DDN in order to find subnetworks algorithmically. The method found nine communities; as shown in Figure 5, the number of diseases in each community varied between clusters of 2 and 102. Figure 5 Disease Communities The plot shows the distribution of community disease connections, which were identified by community detection. The x axis shows the total number of communities identified, and the y axis shows the number of disease nodes in each community. Next, we selected one community that encompassed 20 diseases and showed connections between different disease classes, such as nutritional, neurological, cardiovascular, skin, and digestive-system disorders (Figure 6A). We compared this subnetwork of the DDN with the network derived from probabilistic graphical model of disease state, wherein disease state is defined as the status of all ICD-9 code diagnoses in an individual’s EHR. We used the Ising model framework to develop the probabilistic graphical model of disease state. We checked to see whether we could observe some of the links we identified in our DDN subnetwork (identified via community detection) in the Ising model of disease state (Figure 6B). Through this independent investigation, we identified direct and indirect connections between ICD-9 codes in the Ising model network; these connections were similar to those found in the DDN. Thus, we demonstrated a probabilistic dependence between these diagnosis codes in line with what we see in our network. When we compared the morbid obesity associated with diseases directly neighboring one another in both the DDN and the Ising model (Figure 6), we found many similarities. Specifically, the comorbidities that showed direct links to morbid obesity in both networks were sleep apnea,45 lumbago,46 and edema.47 These results suggest that the probabilistic dependencies observed between these diseases in the Ising model network can probably be explained by the shared genetic architecture that was identified through the DDN. In the DDN, we also found links between morbid obesity and cardiovascular diseases (coronary atherosclerosis and intermediate coronary syndrome), which are known comorbidities.45 Other interesting links with morbid obesity were bariatric-surgery-associated conditions, such as post-gastric absorption and post-surgical non-absorption. It is possible that these connections might be due to a diagnosis correlation that arose in the EHR when an individual underwent bariatric surgery because of their pre-existing condition of morbid obesity. Gout was also a comorbidity of morbid obesity.45 However, these diseases were connected indirectly through another comorbidity: sleep apnea. With this example, we highlight the core strength of EHR-based studies, which allow us to answer similar questions about disease relationships with different methods and thereby provide more robustness to the findings. Figure 6 Comparison of Disease-Disease Network Construction through Two Orthogonal Approaches The figure illustrates the similarities between the disease network that was constructed on the basis of genetic associations (the DDN) (A) and the probabilistic model created from longitudinal EHR data (the Ising model) (B).