Material and Methods

Cross-Phenotype Associations
To construct the DDN, we used the genetic associations, identified through the PheWAS approach, that were reported in in our previous study to comprehensively test for associations between 625,325 SNPs and 541 EHR-based phenotypes.13 As part of MyCode initiative, individuals agreed to provide blood and DNA samples for broad, future research, including genomic analyses as part of the Regeneron-Geisinger DiscovEHR collaboration and linking to data in the Geisinger EHR under a protocol approved by the Geisinger Institutional Review Board. The association testing was performed on genotype and phenotype data from 38,668 unrelated individuals. We used 31,017 associations with a p value < 1 × 10−4 to generate a network between disease diagnoses derived from ICD-9 phenotypes.13

Construction of the Network

Disease-Disease Network
In a bipartite network, the edges (E) are only formed between two distinct node groups. Different network objects, commonly represented as circles or dots, are referred to as nodes, and the connections drawn between these nodes are referred to as edges. The two nod groups in our DDN are diseases (D) and SNPs (S), and these two groups can be represented in a network by N = (D,S,E), where E is an edge between two nodes. We also accounted for the linkage disequilibrium (LD) correlations between the SNPs in the association results used for construction of the network. Therefore, S can be either a SNP or an LD haplotype block shared between the two diseases (D). One can further compress the information in a bipartite network by projecting the network for each group of nodes (DorS), such that the nodes in the projection for one group will form an edge if they share at least one node with the other group. We constructed a bipartite network projection of diseases on the basis of shared SNP associations identified in the PheWAS analysis. In the DDN, nodes represent disease diagnoses, and two nodes are connected to each other when they share one or more SNP or an LD haplotype block (Figure 1 and Table S1). Further, we divided the ICD-9 codes into broader disease classes based on the ICD-9 categories reclassified by Rassekh et al.16 We used a software called Gephi to construct and visualize the DDN (see Web Resources).
To evaluate the strength of the associations, we applied the hypergeometric test (SciPy implementation) to calculate the probability that an ICD-9 code shared associated SNPs with another ICD-9 code as a result of pure chance. The hypergeometric test is a generalization of Fisher’s exact test for the one-tailed case and has been applied to gene-set enrichment tests,17, 18, 19 gene-GO term-association tests,20 and quantification of mosaicism,21 among other tests. Because our genetic association data come from a single source, the number of SNPs associated with each disease can be compared, and thus this method surpasses some of the limitations of GWASs or literature-based networks. Given a population of N SNPs, wherein K is associated with given ICD-9 code 1 and n is associated with given ICD-9 code 2, the probability that strictly k SNPs are associated with both ICD-9 codes is given by the probability mass function as follows:p=( N−KCn−k)( KCk) NCnThe integral of this function is called the cumulative distribution function (CDF). To get the probability that k or more SNPs are associated with both ICD-9 codes, we took (1 – CDF), the complementary cumulative distribution function (CCDF). Generally, the p value for a disease-disease association will be lower if the number of common SNP associations (k) is higher than the number of SNPs associated with each disease.

Network Statistics
Network statistics allow for the descriptive characterization of a network graph and the identification of meaningful connections. In this study, we applied various network analysis approaches to the DDN to identify the most crucial disease nodes, as well as to automate the extraction of disease cluster subnetworks. We used the statistical packages available as plug-ins within Gephi to perform all of the network analytics.

Hub Diseases
Hub nodes are those that have significantly more edges than other nodes. These nodes are important because they play a critical role in the centrality of the network. There are a number of ways to measure centrality of a network and, hence, identify hub disease nodes. In this case, we used a measure called betweenness centrality to identify such nodes in the DDN. Betweenness centrality for a given node (ni) is calculated on the basis of the number of shortest paths between two other nodes (nj,nk) in the network and the number of times these paths pass through the node (ni). We computed the betweenness centrality for all pairs of nodes across the whole network. The mathematical notation of betweenness centrality is as follows:CB(ni)=∑j,kgj,k(ni)gj,kgj,kShortestpathlinkingnodejandkgj,k(ni)NumberofpathspassingthroughnodeiThe nodes with a high betweenness centrality value tend to be most important for keeping the network connected. We used this measure to change the representation of the nodes in the network by scaling the node size based on its betweenness centrality. In this way, we were able to visually identify the most important disease nodes in the network on the basis of network statistics.

Community Detection
Community detection is an approach used in network analytics to partition a large, densely connected network into smaller subnetworks.22, 23 Various community-detection methods can algorithmically identify meaningful subnetworks. These methods have most commonly been applied in social network analyses for the detection of structure in social interactions.24 We used Louvain’s method,22, 25 which is implemented in Gephi as the “modularity” feature, to partition the DDN and detect subnetworks, or communities, of diseases (see Web Resources). The communities detected had varying types of disease nodes. We used the identified disease communities to further investigate the biological interpretation of disease connections in the DDN.

Tissue-Specific Functional Annotation
To investigate the tissue-specific disease connections in the network, we used annotations from the 15 chromatin state models available on the Roadmap Epigenomics website to assign chromatin states to different tissues.26 Using posterior probability, we assigned the most probable chromatin states for 127 different tissues, defined via posterior probability, to every 200 base-pair window across the genome. We also consolidated the 127 different tissues into 27 functional groups of tissues; for example, we used four different adipose tissues for the chromatin-state prediction, but we consolidated these into one group called “adipose tissue.”27 To calculate the most probable chromatin state for each functional tissue category, we averaged the posterior probabilities.27 The chromatin-state prediction provides the annotations for the most active to the most quiescent regions of the non-coding genome. In this study, we focused on the active regulatory elements, such as enhancers, promoters, and active transcription start site (TSS); as a proof-of-concept, we only analyzed enhancer-state annotations.
The chromosome base pair position of each SNP was mapped onto the annotated chromatin states of the 27 functional groups of tissues. We considered variants to belong inside enhancer regions when a chromosome base pair position mapped onto either of the three enhancer states: enhancer (Enh), genic enhancer (EnhG), and bivalent enhancer (EnhBiv). Then, a total of seven DDNs were constructed from the associations between SNPs in enhancer regions. For visualization, we overlaid the networks created for each tissue onto the original DDN we had constructed.