Extending the mutual information measure to rank inferred literature relationships Abstract Background Within the peer-reviewed literature, associations between two things are not always recognized until commonalities between them become apparent. These commonalities can provide justification for the inference of a new relationship where none was previously known, and are the basis of most observation-based hypothesis formation. It has been shown that the crux of the problem is not finding inferable associations, which are extraordinarily abundant given the scale-free networks that arise from literature-based associations, but determining which ones are informative. The Mutual Information Measure (MIM) is a well-established method to measure how informative an association is, but is limited to direct (i.e. observable) associations. Results Herein, we attempt to extend the calculation of mutual information to indirect (i.e. inferable) associations by using the MIM of shared associations. Objects of general research interest (e.g. genes, diseases, phenotypes, drugs, ontology categories) found within MEDLINE are used to create a network of associations for evaluation. Conclusions Mutual information calculations can be effectively extended into implied relationships and a significance cutoff estimated from analysis of random word networks. Of the models tested, the shared minimum MIM (MMIM) model is found to correlate best with the observed strength and frequency of known associations. Using three test cases, the MMIM method tends to rank more specific relationships higher than counting the number of shared relationships within a network. Background Most scientific fields are data-intensive, but perhaps even more so for biology and medicine. Sequencing efforts have generated billions of base pairs of genetic information across hundreds of thousands of species, and ushered in the relatively recent completion of the Human Genome Project[1]. Microarrays enable thousands of transcriptional measurements per experiment [2], and high-throughput chemistry enables the simultaneous screening of thousands of molecules at a time for activity[3]. New discoveries among research areas (e.g. genetics, medicine, chemistry) lead to a necessarily increasing amount of specialization as more objects (e.g. genes, diseases, phenotypes, chemical compounds, etc.) are discovered to be of research interest. This is reflected by the growth in the number of scholarly journals published every year as well as the number of total records indexed in biomedical literature reference databases such as MEDLINE[4]. In any field, the gain in our cumulative scientific knowledge has the unfortunate effect of narrowing our perspectives as individuals – providing us with far too much information to assimilate, and far too many variables to analyze. Yet the most valuable type of information is often what is not known or apparent to others – information implied by a set of data, facts or associations. History is replete with examples of insights into scientific problems coming from a series of observations from apparently unrelated fields, discoveries or events. But how could one retrieve or compile such information in cases where one is not certain what to look for and the search space is vast? This is the primary reason that methods of data-mining and knowledge discovery are becoming increasingly important in handling this explosion of information. Previous research Most scientific knowledge comes from peer-reviewed articles and is written in free-form text, which is difficult to analyze algorithmically. However, the idea that novel relationships within text could be computationally identified based upon existing relationships has its roots in an approach developed by a researcher named Don Swanson, who used software to identify words shared between article titles [5]. Using their software, called Arrowsmith, Swanson and Smalheiser were able to identify common intermediates between Raynaud's Disease (a circulatory disorder restricting blood-flow to the extremities) and the dietary effects of fish oil, leading to the hypothesis and subsequent proof [6] that compounds within dietary fish oil could alleviate the symptoms of Raynaud's Disease [5,7]. To explain why such a sensible hypothesis had gone unnoticed by researchers in either field alone, the term "non-interactive literatures" was coined. This term, in essence, implies that increasing specialization among all fields results in a relative lack of awareness of the findings in other, less related fields. These entities that do not have known or documented associations, yet share intermediate relationships, have been referred to as "transitive", "implicit", "indirect" or "inferable" relationships. Deciding that no relationship exists when no co-mentions exist is somewhat of an over-simplification, but a necessary one. Realistically, several co-mentions between terms could be observed without a definitive relationship present. However, if one uses a greater-than-zero cutoff to define when a relationship exists, false-negatives become a problem: Some co-mentions below the cutoff will constitute a real relationship. Using zero co-mentions as a cutoff is a convenience to avoid this problem even if the end result is that some relationships are declared "known" when they really are not. While pioneering, a keyword-based method such as Swanson and Smalheiser's is both limiting and highly burdensome, especially where a large body of literature is concerned, because the number of unique keywords grows quickly per record analyzed. Neither is the method amenable to open-ended querying – that is, telling a user what is implicitly related to a query term. Rather, one must essentially begin by postulating a relationship between a query term, A, and another term, C, where a set of intermediate terms, B, can be found that connect the two. Even improvements in visualizing or exploring records that share commonalities and/or define entities of interest [8,9] are limited because they require manual user navigation and analysis of results. Other approaches have attempted to utilize Medical Subheadings (MeSH)[10] or the Unified Medical Language System (UMLS) [11] to engage in open-ended discovery by pairing concepts, counting the number of relationships shared by two terms as a means of judging its implicit significance. However, these approaches do not take into account the fact that the more general the nature of the relationship is, the more connections are likely to be shared by two terms. It was previously demonstrated that, because the number of associations between terms follows a scale-free, or inverse power-law, distribution, the number of inferable associations with any given term rapidly approaches the maximum number of possible associations as the number of direct associations grows[12]. That is, even if one starts with a term that is only associated with several others, at least one of these is likely to be associated with a very large number of terms. Thus, the starting term will be implicitly associated with most of the network (the "small world" phenomenon). Therefore, the issue is not identifying implicit associations, but somehow judging which of the many implicit associations are worth further examination. Previous work demonstrated that it was feasible to identify pertinent implicit relationships by ranking inferred relationships and preferentially examining those at the top of the list[12]. One of its shortcomings, though, was that associations between terms are assigned based upon co-occurrence of terms within an abstract. This is a fairly well accepted means of assigning tentative relationships between terms, but when considering the scale-free distribution of objects within the literature, it is apparent that some frequently mentioned objects could be co-mentioned many times with other terms without any actual biological association being implied. Figure 1 uses an analysis of terms related to the term "capsaicin" to illustrate this point. Although the MIM may have drawbacks in identifying broad relationships (for example, see Table 1 – some very pertinent relationships receive modest MIM scores if the terms are common), it is a very straightforward and well-established means of measuring information content between two terms. Such a measure would enable us to pursue more specific relationships – those with high information content. MIM, however, can only be calculated using direct (A-B) or (B-C) relationships rather than implicit (A-C). Thus, the goal here is to test methods of extending the MIM calculation to include implied relationships such that a statement can be made about the implied mutual information content of two unrelated terms. Identifying literature-based associations The general approach to associating objects by searching for their co-occurrence within text has been used in many fields as a simple, yet comprehensive way to identify potential associations. In biology and medicine, co-occurrence has been used to identify potential relationships between genes [13,14], proteins [15] and drugs [16]. The disadvantage of this approach is that associations are very general – that is, no specifics on how two objects are related or associated are obtained by this method. False-positives can also be a problem, as terms far apart within the abstract with no apparent association may be included as "relationships". The advantages are that it is easy to implement and comprehensive. To begin a search for novel, inferable associations within the literature, relevant "objects" of interest in scientific research were first defined by assimilating database entries from relevant databases into one central database. By doing this, both words and phrases can be identified within text, and it permits synonymous terms to be mapped to primary terms. All electronically available literature was then analyzed for associations between objects of interest by searching for their co-occurrence within MEDLINE records (titles & abstracts), summing the total number found. The significance of this collective set of co-occurrences is evaluated using the mutual information measure (MIM), which was originally based upon Shannon's Entropy theory [17], but has also been successful in identifying lexical dependencies [18]. By processing a body of literature that comprehensively covers a topic, field or area, it can be asserted that the current state of knowledge has been approximated, at least on the level of broad object-object associations. All available literature was processed, creating a network of associations for each object. This network can in turn be analyzed for associations shared by two unassociated objects. That is, we can use the network to identify objects that share associations but are not themselves associated. Such objects are said to be implicitly associated with each other, and new associations can be potentially inferred by evaluation of their shared associations. Since there are many implicitly associated objects, the relevance of each one is also evaluated using the MIM. However, a MIM can be calculated to evaluate the relevance of an association between A and B and between B and C, but it is not clear how each of these individual scores extends to the inference of an association between A and C. Therefore, we explore and evaluate different methods. Methods and algorithms Code was written in Visual Basic 6.0 (SP5) using ODBC extensions to interface with an SQL-based database, with database queries written in SQL. Programs were executed on a Pentium 4 3.06 GHz machine with 1 GB of RAM and two ultra-fast SCSI hard drives. The National Library of Medicine graciously provided an electronic archive of MEDLINE records in XML format. To obtain a set of common words for analysis, the Merriam-Webster dictionary was parsed into individual words and each word summed by the number of times it was observed within the dictionary. 10,000 words were chosen with dictionary frequencies ranging from 322 to 28. This range was selected so that no extremely common or rare words would be within the list. To create a database of random word associations, only 100,000 titles/abstracts were used. This was done to avoid network saturation (i.e. having a significant number of objects related to every other object) and to ensure that the distribution in the number of associations between words resembled the same power-law distribution observed for biomedical objects. The occurrence of such objects within scientific text is identified by comparing phrases within MEDLINE records to entries in the object recognition database (ORD). This ORD is built by inputting terms found in several different biomedical databases, all freely available for download. Objects classified as diseases, disorders, syndromes or phenotypes were obtained from Online Mendelian Inheritance in Man (OMIM) [19]; chemical compounds and small molecules were obtained from the Medical Subject Headings (MeSH) database [20]; approved drug names from the Food and Drug Administration; genes were obtained from Locuslink [21], and ontological classifications for genes were obtained from the Gene Ontology consortium [22]. Assimilation of terms is done automatically, but a table within the ORD contains additional biomedical terms to be added or deleted as deemed necessary (e.g., some databases contain vague or uninformative terms such as "survey" or "extended", useless information such as "deleted entry" or errors such as "#NAME?"). Compared to the overall size of the ORD, this table is small (1,007 entries versus over 223,000 terms assimilated) and designed primarily to reduce clutter. Acronyms for entries, if not explicitly stated within the assimilated database, were obtained from an acronym database[23]. Similarly, spelling variants were also obtained from this database where possible. This database can be accessed online[24]. As an example of spelling variants detected, the user can go to this URL, enter the acronym "ICAM-1" and note the many subtle variations. The acronym resolving heuristic used to construct this database was also used to resolve acronyms within text when they occurred. The Mutual Information Measure A scoring scheme based upon the Mutual Information Measure (MIM) [17] is used to estimate strength of association between co-occurring terms within the literature. It should be noted that other statistical methods of association such as chi-square tests, log-likelihood ratios, z-scores or t-scores could be used as well – these are all means of judging the statistical significance of a relationship. In this paper, however, we will focus on the MIM only as a proof of principle that mutual information calculations can be extended into implicit relationships as well. The MIM has been widely used to quantify dependencies between variables, including co-occurring terms in text [25], and is shown in equation (1): PAB is the measured probability that A and B will be observed together in the same abstract, while PA and PB are the probabilities of observing A or B, respectively, in a given abstract. Furthermore, because scientific research and discovery is a time-dependant process, prior information can be incorporated to refine the probabilities in Equation (1). The describing of a disease or discovery of a gene, for example, will occur at a given point in time (illustrated in Figure 2) within the history of publications. Regardless of an object's overall frequency in the database, the probability it will appear in the literature prior to its discovery is zero. Thus PA and PB are calculated from their time of first appearance. PAB is then calculated using the later of these two dates. PAB, PA and PB are thus calculated as: Where TA and TB are the total number of records A and B are independently mentioned in, respectively, and TAB is the total number of records co-mentioning A and B. Af and Bf represent how many records were read in before the first occurrences of A and B were observed, respectively. Max(Af, Bf) represents the larger of the two values between Af and Bf. And At is the total number of records processed. As an example of how the MIM score is used, assume that the probability A will appear in any given record within a database of records is 10% and the probability of B appearing is the same. If the appearance of A is completely independent of the appearance of B then no information about one can be gained by observing the appearance of the other. The probability both A and B will be observed in the same record is thus 0.1*0.1 = 0.01. The value of MIM in Equation 1 then evaluates to 1 and the log value to zero – the information gained on one object by observing the other. If the probability of observing A increases when B is mentioned, then MIM > 0. If A and B are rarely mentioned together, then MIM < 0. When considering scientific writing style with reference to biomedical objects such as genes, diseases and chemical compounds, there is a probability that two of them might be mentioned together in the same record without having an established association. For example, one of the objects may be very commonly used in many studies (e.g. the gene LacZ is used for staining assays, luciferase is used for luminescence, etc), or one of the objects may be of great scientific/medical interest and authors may make an extra effort to speculate how their results might relate to such objects (e.g. cancer, diabetes, heart disease, apoptosis). The MIM provides a way of quantifying literature-based object dependencies. However, taking the log value can provide a negative weighting to an association when two frequent terms are mentioned together. Optimally, irrelevant or uninformative associations (i.e. those with little mutual information) would be ignored entirely rather than penalized. Therefore, the log function is removed and the equation becomes: The possibility remains that rare associations might receive a very high MIM score [26], but it is hoped that the fact that many MIM scores are being summed and compared will ameliorate this effect when it occurs. Inferring new associations based upon commonalities Figure 3 shows the general conceptual approach undertaken here. Call the primary research object node "A" in a network constructed of MIM scores between objects. For each A there is a set of other objects, or nodes, associated with it by virtue of co-occurrence in the literature. We'll call this set "B" and assuming a total of t objects in this set, each individual object can be given the symbol "Bn", where 0 Fish oil were lower than expected. Upon examination, Swanson's discovery of this link, although validated experimentally [6], has apparently not generated a lot of continued experimental research interest in this area in the 15 years since then. A search via Ovid on "(raynaud or raynauds or raynaud's) and (eicosapentaenoic or docosahexaenoic or fish oil)" yielded only 5 papers, three of which were text mining papers including Swanson's original study [5,30,32], the fourth was the 1989 validation study [6] and the fifth was a study showing that fish oil did not have a significant effect upon Raynaud's phenomenon in mixed cryoglobulinemia (a syndrome in which Raynaud's is one of many symptoms)[33]. Examining the relationships that tend to rank highly in both models it is apparent that, when ranking by the number of shared relationships, the higher-scoring entries tend to be more general and vague in nature (e.g., links to "blood", "development", "females" and "males"). When ranked by the MMIM, their relevance to the object in question is more readily apparent. For example, sumatriptan is a drug used to treat migraines and other items ranking highly on the list such as nausea, vomiting, and dizziness typically accompany migraines. Notably, one of the important links that Swanson used to surmise the role of magnesium is also on this list: Seizures, which cause migraines. Discussion Information retrieval (IR) methods are limited to querying what is known; yet often the most valuable information is what is not directly known. Mutual information measures have been used successfully in many IR applications, and a method has been presented here to extend it to inferable associations. We find that the normalized MMIM method of ranking inferences based upon their shared associations correlates best the level of currently established mutual information. A good correlation is suggestive that mutual information is being captured even though evaluation proceeds indirectly, through intermediates. For simplicity, we have used a cutoff of zero co-occurrences to suggest that no association between objects has been made, but it is quite possible that a number of co-occurrences could be noted between two objects yet no specific relationship between them documented. Or additionally, a certain relationship may be known between the two, but other important relationships still remain to be inferred. At this point, however, it is not clear how this would effectively and quantitatively be taken into account. The method reported was applied to biomedical research, but could ostensibly be applied to any domain in which the goal is to identify undiscovered relationships. Importantly, this method of automated inference ranking provides a quantitative way of prioritizing inferred associations when available literature is growing rapidly in size and scope.