Information Theoretic Measures for HPO Annotations The algorithm in Figure 1 uses several information theoretic measures, discussed below. TFIDF is a standard information-retrieval metric for ranking terms on the basis of their co-occurrence and specificity in the context of a given set of documents. In our case, the goal is to rank HPO terms according to their frequency and specificity in the context of a particular disorder. TFIDF is adapted below (to take into account the disorder-specific context), where t denotes an HPO term, D denotes the disease under scrutiny, and TD represents the total number of disorders (i.e., 3,145).TFIDF(t,D)=TF(t,D)×IDF(t,D) TF(t,D), the term frequency of HPO term t for disease D, is defined as the number of D-associated abstracts in which a term t appears at least once (regardless of the number of mentions in a particular abstract), and the inverse document frequency, IDF(t, D), is defined as the logarithm of the quotient of the total number of diseases (TD) divided by the number of diseases for which the HPO term in question is mentioned in at least one abstract.IDF(t,D)=logTD|{d∈D:t∈d}| The IC of an individual HPO term within the MEDLINE corpus can be estimated with its frequency among annotations of the entire corpus. Intuitively, the IC of a term such as “fever” (HP: 0001945) is less than that of a term such as “aortic arch calcification” (HP: 0005303) because fewer diseases are characterized by the latter abnormality, and so knowing that an individual has aortic arch calcification narrows down the differential diagnosis much more than knowing that an individual has fever. For each term t of the HPO, the IC is quantified as the negative logarithm of its frequency: IC(t)=−logp(t). If a disease is annotated with any term t in the HPO, it must also be annotated with all the ancestors of t. Therefore, the IC of terms is calculated on the basis of annotations with the term or any of its descendants in the HPO.41 For instance, if seven of 1,000 abstracts are annotated with a certain HPO term t′, and three more abstracts are annotated with descendants of t′, then the frequency of the term would be calculated as p(t′) = 10 / 1,000, and the IC of the term would be calculated as IC(t)'=−logp(0.01). The higher (i.e., closer to the root) in the ontology a term is located, the lower its IC. We use this as an additional term to define TFIDFIC for HPO term t and disease D asTFIDFIC(t,D)=TFIDF(t,D)×IC(t).