Material and Methods Extraction of HPO Terms By Automatic Concept Recognition Concept recognition (CR) extracts ontology terms from text with the aim of leveraging structured knowledge from unstructured data. For example, CR might be able to identify the term “macrocephaly” (HP: 0000256) within an abstract that contains the phrase “large head” because the latter is listed as a synonym in the entry HP: 0000256. Published CR approaches rely on direct dictionary lookup combined with stemming and word-permutation algorithms38 or use natural-language-processing pipelines with techniques such as sentence splitting, tokenization, and part-of-speech tagging.39 In our experiments we used a CR tool specifically tailored to address the challenges of extracting phenotype concepts—the Bio-LarK Concept Recognizer.40 Bio-LarK uses a two-step approach to index and retrieve ontology terms in combination with a series of language techniques to enable term normalization. In addition to providing standard CR, the system is able to decompose and align conjunctive terms (e.g., “short and broad fingers” aligns to “short finger” [HP: 0009381] and “broad finger” [HP: 0001500]), as well as recognize and process non-canonical phenotypes, such as “fingers are short and broad,” which would be aligned to the same terms as in the previous example. Our current CR approach does not attempt to detect negation, which might represent a cause of false-positive results. However, because of the post-processing steps used to generate the final annotations on the basis of threshold values for annotation frequency and IC (see below), our procedure will not, in general, be sensitive to isolated negative assertions. PubMed-MEDLINE 2014 Corpus The CR process was performed on the 2014 release of the PubMed-MEDLINE corpus. The corpus contains 22,376,811 articles, of which 13,262,617 have a valid title and abstract (most of the missing entries represent articles in languages other than English and only their titles are listed). MEDLINE abstracts are associated with a series of medical subject headings (MeSHs); the main headings (descriptors) provide a schematic description of the topic of the article. The descriptors are divided into 16 categories, including category C, “diseases.” Category C contains 4,620 unique entries, and we refer to it here as “MeSH diseases.” We note that although MeSH category C is described as comprising diseases, many of the terms in the complete tree C (4,620 entries) do not refer to specific diseases. For instance, many of the terms describe general categories, such as “brain diseases” (MeSH: D001927), veterinary diseases (e.g., “brucellosis, bovine” [MeSH: D002007]), and various other entities, such as “cadaver” (MeSH: D002102). Others represent phenotypic features of diseases rather than actual disease entities; one example is “Cheyne-Stokes respiration” (MeSH: D002639), which is an abnormal breathing pattern that can be observed in diseases such as central sleep apnea syndrome. We excluded such MeSH entries by careful manual curation, leaving a total of 3,145 MeSH category C descriptors that we judged to actually represent specific disease entries. Only these entries were used for the analysis described in this manuscript. We filtered the 13,262,617 abstracts on the basis of the MeSH terms to retain only those abstracts that included at least one of the 3,145 disease entries from the MeSH disease list and then processed them with the Bio-LarK Concept Recognizer. In some cases, a single abstract was annotated with multiple MeSH disease terms, some of which were also featured as major topics for the article under scrutiny. For the purpose of this analysis, we included all abstracts independently of the number of associated MeSH terms or their major topic feature. Filtering HPO Annotations Many abstracts that describe a given disease also mention a certain HPO term. Consequently, that disease is more likely to be characterized by the corresponding phenotypic abnormality. For instance, the PubMed abstract with the PubMed identifier PMID: 23886833 is indexed with the MeSH term “encephalitis, herpes simplex,” and parsing the record with Bio-LarK reveals a number of HPO terms, including “headache” (HP: 0002315). Therefore, one might be tempted to conclude that this type of encephalitis can be characterized by headaches, but from this single observation it cannot be guaranteed that the abstract is indeed making this assertion. The abstract could, for instance, be describing an adverse effect of a medication, a differential diagnosis, or one of a number of other things. We reasoned that if an HPO term were identified in multiple abstracts associated with a given disease from the MeSH disease list, then it would be more likely to represent a genuine phenotypic abnormality associated with the disease. However, frequency alone is not a strong enough indicator of a correct association between a phenotype and a disease. Ideally, the phenotype should also be specific to (i.e., present only in a limited number of) certain diseases. Given this required balance, we developed a procedure that aims to distinguish the true annotations on the basis of three metrics: (1) the balance between frequency and specificity; (2) the IC of the term—i.e., the overall degree of specificity of the term in our corpus of diseases; and (3) the disease-category-driven density of a subset of terms, based on the shortest path between them in the HPO. The balance between frequency and specificity is measured with a standard information-retrieval technique: term frequency, inverse document frequency (TFIDF). The TFIDF weighs HPO terms highly if they occur with high frequency among abstracts annotated to a disease but down-weighs terms that are common within the entire corpus (see the following section). Figure 1 summarizes the algorithm we have developed. It takes as input the initial set of HPO terms and, using three tuning parameters, produces a final set of candidates. The three tuning parameters control term cutoffs at different stages: (1) n, which defines the initial TFIDF threshold used for creating the clustering seeds; (2) m, which defines a second specificity threshold (over TFIDFIC; see following section) used for pruning terms left over from the first threshold; and (3) e, which defines the density margin that dictates the inclusion or exclusion of a term in a cluster. The algorithm consists of three steps. First, the initial set of terms is filtered with TFIDF for the creation of clustering seeds (lines 1–3). Second, these clustering seeds are grouped according to their common top-level HPO ancestor —i.e., the top-level HPO abnormality (e.g., blood or skeletal system; line 4). The intuition here is that most diseases affect, in principle, a very limited number of major organs, and hence, most true positives will be grouped according to these major organs (corresponding to the top-level HPO phenotypic abnormality terms). Once the clustering seeds are grouped, we look for the group-based subset of terms that form the single shortest ontological path among them (i.e., the sub-group with the minimum density; lines 5–7). This can be seen as an inverse analogy to the traveling salesman problem, where the shortest path between two terms (i.e., the number of edges required to connect them in the HPO) denotes the cost, and the goal is to minimize the SD of the array of shortest paths. We adapted the Hungarian algorithm to solve this problem. The resulting subset is added to the final list of candidates (line 8). Finally, the list of terms initially filtered out with TFIDF is pruned with TFIDFIC (lines 9 and 10), and the terms are grouped according to the top-level HPO abnormalities in the same manner as the clustering seeds (line 11). Incrementally, using the group-based density and set of seeds computed in the previous step, we append each leftover term to the seed subset and compute an aggregated density. If the new density is within the limits established by the density margin error parameter (e) with respect to the seed density, then the term is added to the final candidates (lines 12–15). Given a gold-standard corpus, one of the main advantages of this algorithm is the opportunity for learning diverse values for the three parameters, subject to a particular goal. For example, the above-mentioned assumption (i.e., diseases affect a very limited set of major organs) can be transformed into a learning task based on disease categories. We experimented with the 41 manually curated diseases, split into 13 categories dictated by the top-level terms (e.g., cardiovascular diseases, integumentary system diseases, etc.) in the Disease Ontology (DO), and aimed to maximize the category-based true-positive rate. This can be realized by learning sets of parameters corresponding to each disease category. The experimental results showed an overall resulting precision of 66.8%, including highlights such as over 70% precision for diseases by infectious agents (73.0%), diseases of the nervous system (77.8%), or immune system diseases (82.8%). Similarly, we experimented with targeting a maximized overall F-score (i.e., the harmonic mean of precision and recall—a balance between coverage and true-positive rate) and achieved a value of 45.1%. This value is equivalent to an average precision of around 60% associated with a recall of around 40%. Information Theoretic Measures for HPO Annotations The algorithm in Figure 1 uses several information theoretic measures, discussed below. TFIDF is a standard information-retrieval metric for ranking terms on the basis of their co-occurrence and specificity in the context of a given set of documents. In our case, the goal is to rank HPO terms according to their frequency and specificity in the context of a particular disorder. TFIDF is adapted below (to take into account the disorder-specific context), where t denotes an HPO term, D denotes the disease under scrutiny, and TD represents the total number of disorders (i.e., 3,145).TFIDF(t,D)=TF(t,D)×IDF(t,D) TF(t,D), the term frequency of HPO term t for disease D, is defined as the number of D-associated abstracts in which a term t appears at least once (regardless of the number of mentions in a particular abstract), and the inverse document frequency, IDF(t, D), is defined as the logarithm of the quotient of the total number of diseases (TD) divided by the number of diseases for which the HPO term in question is mentioned in at least one abstract.IDF(t,D)=logTD|{d∈D:t∈d}| The IC of an individual HPO term within the MEDLINE corpus can be estimated with its frequency among annotations of the entire corpus. Intuitively, the IC of a term such as “fever” (HP: 0001945) is less than that of a term such as “aortic arch calcification” (HP: 0005303) because fewer diseases are characterized by the latter abnormality, and so knowing that an individual has aortic arch calcification narrows down the differential diagnosis much more than knowing that an individual has fever. For each term t of the HPO, the IC is quantified as the negative logarithm of its frequency: IC(t)=−logp(t). If a disease is annotated with any term t in the HPO, it must also be annotated with all the ancestors of t. Therefore, the IC of terms is calculated on the basis of annotations with the term or any of its descendants in the HPO.41 For instance, if seven of 1,000 abstracts are annotated with a certain HPO term t′, and three more abstracts are annotated with descendants of t′, then the frequency of the term would be calculated as p(t′) = 10 / 1,000, and the IC of the term would be calculated as IC(t)'=−logp(0.01). The higher (i.e., closer to the root) in the ontology a term is located, the lower its IC. We use this as an additional term to define TFIDFIC for HPO term t and disease D asTFIDFIC(t,D)=TFIDF(t,D)×IC(t). Calculation of Phenotypic Overlap with an Extended Jaccard Index The Jaccard index is a standard measure of similarity between two sample sets, A and B, and is defined as the size of the intersection divided by the size of the union of the sample sets:J(A,B)=|A∩B||A∪B|. The value of the Jaccard index ranges from 0 for complete dissimilarity to 1 for identity. In a typical set-based context, the Jaccard index is computed on the strict intersection and union of the elements. However, in our context these elements represent ontology terms, structured in a logical hierarchy. And, as such, we can rely on the subsumption relation between terms when computing intersection and union. We exploited this aspect in the computation of the Jaccard index. A match between two terms was recorded not only when the two terms matched exactly (i.e., “cranial hyperostosis” is the same as “cranial hyperostosis”) but also when the subsumption relation was present (i.e., “cranial hyperostosis” is a parent of “calvarial hyperostosis” and an ancestor of “mandibular hyperostosis”; Figure S1). Validation of HPO Annotations for Common Disorders We chose three to five common diseases from each of the 13 DO upper-level categories used in our common-disease network (CDN; see below) for a total of 41 diseases. We used a Perl script to choose diseases at random from among all diseases in the categories. We examined the diseases manually by assessing each HPO term mentioned at least once in any abstract describing the disease in question (thus, we evaluated substantially more HPO terms than merely the set of terms chosen by our annotation pipeline on the basis of frequency and specificity of the term). Biocuration was performed by N.V., G.B., D.V., A.Z., M.H., and P.N.R., and all annotations were validated by P.N.R., who is both a computer scientist and a medical doctor. This allowed us to assess the true-positive, false-positive, and false-negative rates as shown in Tables S1–S41. CDN In order to validate and visualize the phenotype annotations obtained for common disease, we constructed a CDN by computing the pairwise similarity of a total of 1,678 diseases (i.e., annotated MeSH entries) belonging to 13 DO categories such as “nervous system disease” (DOID: 863) or “respiratory system disease” (DOID: 1579) (Figure S2). Note that some diseases belong to multiple DO classes (Figure S3). For each disease, we obtained all the HPO annotations that our CR algorithm had associated with the disease. The annotation frequency of a term was defined as the proportion of diseases that were annotated by the term or any of its descendent terms. In order to calculate similarity between two terms (t1,t2), we used the IC of their most informative common ancestor (MICA),3 denoted as MICA(t1,t2). We used the above-mentioned term-similarity measures to calculate a semantic-similarity score for two diseases (D1,D2). In our case, for each of the terms of D1, the “best match” among the terms annotated D2 was found, and the average overall query terms was calculated. This was defined as the similarity:sim(D1→D2)=avg[∑s∈D1maxt∈D2IC(MICA(s,t))],where the average was taken over all terms s to which disease D1 is annotated. Note that this score is asymmetric, i.e., it is not necessarily the case that sim(D1→D2)=sim(D2→D1). Therefore, for the analysis described here, we used a symmetric similarity score:sim(D1,D2)=12sim(D1→D2)+12sim(D2→D1). The CDN consists of nodes that represent common diseases and edges that indicate that two diseases are phenotypically similar. In order to create the CDN, we calculated the symmetric similarity score for all pairs of diseases. The network was visualized with the force-directed layout algorithm of Cytoscape,42 whereby an edge between nodes was drawn if the similarity between two corresponding diseases exceeded 2.0 (simulation cutoff [simcut]). The final CDN (CDN-o) consisted of 1,148 diseases and 4,059 edges. Statistical Significance of the CDN In order to test the statistical significance of the distribution of phenotypic similarity among diseases within the same disease category or between different categories, we introduced the concept of the gray-edge fraction (GEF). That is, we visualized edges between nodes (diseases) that do not belong to one of the same 13 general disease categories as gray edges. The GEF was defined as the proportion of gray edges among all edges in the CDN. The lower the GEF, the better the phenotypic clustering of diseases agrees with the classification of the diseases into the 13 categories. The original CDN (CDN-o) comprised 3,547 edges, 998 of which were gray edges, corresponding to a GEF of 0.246 (red arrow in Figure S4A). We tested two randomization procedures, edge randomization (er) and annotation randomization (ar). The edge-permutation procedure retains the number of edges and the degree distribution of the network.43 Two edges, A-B and X-Y, are chosen at random and reshuffled to create the edges A-Y and X-B. Reshuffling is skipped if the edges A-Y and X-B already exist. Reshuffling is performed 10,000 times, resulting is an edge-randomized version of CDN-o, which we call CDN-er and for which we can again compute the GEF. We constructed 1,000 versions of CDN-er and plotted the distribution of the resulting GEF values in Figure S4A. As one can see, the p value of the CDN is less than 0.001 because none of the edge-randomized CDNs achieved the same or a smaller GEF than the original CDN. We additionally performed a test in which we randomized the HPO terms associated with each disease (ar). For this, we randomly selected 50% of the terms associated with each disease and replaced them with randomly selected HPO terms. We computed the randomized CDN (called CDN-ar) by using the above procedures used to construct the CDN-o. We repeated this procedure 100 times and computed the GEF for each CDN-ar. Note that each CDN-ar might not have the same amount of nodes and edges as the CDN-o. When using the same simcut (2.0) used for constructing the CDN-o, we obtained much smaller networks (fewer than 100 nodes). The distribution of GEF values of CDN-ar with simcut 2.0 is shown in Figure S4B. No CDN-ar achieved a GEF less than or equal to the CDN-o GEF, which corresponds to a p value of less than 0.01. We modified the simcut to 1.4 because it leads to CDN-ar versions with approximately the same amount of nodes as CDN-o. The distribution of the resulting GEF values is shown in Figure S4C. Again, not a single CDN-ar constructed with a simcut of 1.4 achieved a GEF less than or equal to the CDN-o GEF, which corresponds to a p value of less than 0.01. GWAS Data GWAS Central provides a comprehensive collection of summary-level genetic-association data and advanced visualization tools to allow comparison and discovery of datasets from the perspective of genes, genome regions, phenotypes, or traits.33 The project collates association data and study metadata from many disparate sources, including the National Human Genome Research Institute GWAS Catalog,35 and receives frequent data submissions from researchers who wish to make their research findings publicly available. All gathered and submitted data are extensively curated by a team of post-doctoral genetics researchers who manually evaluate each study for its range of phenotype content and apply appropriately chosen MeSH terms. As of December 2014, the resource contained 69 million p values for over 1,800 studies. Data and metadata for up to 1,000 associations can be freely downloaded from the BioMart-based system (GWAS Mart), and larger custom data dumps (up to and including the complete database) are available via contacting the GWAS Central development team and agreeing with a data-sharing statement. Thus, to provide data for the present study, we generated a tab-separated file representing 1,574 studies and 34,252 unique SNPs (annotated to 675 unique MeSH terms) and containing the GWAS Central study identifier, PubMed identifier, dbSNP “rs” identifier, p value, and MeSH identifier for all associations with p < 1 × 10−5. We compiled the list of genes considered for our experiments by retrieving the “mapped genes” column from the database SCAN and identifying those genes corresponding to the GWAS Central SNPs. Where no mapped genes were reported, we used the upstream, as well as downstream, genes listed by SCAN.44