@ewha-bio:39
PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis
In this -paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.
New scientific discoveries are based on the existing knowledge which has to be accessible and thus usable by the scientific community (Andrade et al., 2000). In the 19th century, the spread of scientific information was still done by writing letters with new discoveries to a small number of colleagues. Printed journals took over this job
professionally. We are currently on another transition into electronic media. Electronic storage allows the customized extraction of information from the literature and its combination with other data resources such as heterogeneous databases. In fact, it is not only an opportunity, but also a pressing need as the volume of scientific literature is increasing immensely. Furthermore, the scientific community is growing so that even for a rather specialized field it becomes impossible to stay up-to-date just through personal contacts in that particular community. The growing amount of knowledge also increases the chance for new ideas based on the combination of solutions from different fields. And there is a necessity of accessing and integrating all scientific information to be able to judge the own progress and to get inspired by new questions and answers.
Since the human genome sequences have been decoded, especially in biology and bioinformatics, there are more and more people devoted to this research domain and hundreds of on-line databases characterizing biological information such as sequences, structures, molecular interactions, and expression patterns (Chiang et al., 2004). Despite the prevalent topic of research, the end result of all biological experiments is a publication in the form of text. However, information in text form, such as MEDLINE ( http://www.pubmed.gov/ ), is a greatly underutilized source of biological information to biological researchers. Because it takes lots of time to obtain the important and precise information from huge databases with daily increase. Thus knowledge discovery from a large collection of scientific papers is very importantfor efficient biological and biomedical research. Until now, a number of tools and approaches have been developed to resolve such needs. There are many systems analyzing abstracts in MEDLINE to offer bio-related information services. Suiseki (Blaschke etal., 1999; Blaschke et al., 2002) and BioBiblioMetrics (Stapley et al., 2000) focus on the protein-protein interaction extraction and visualization. MedMiner (Tanabe et al., 1999) utilizes external data sources such as GeneCard (Safran et al., 2003) and MEDLINE for offering structured information about specific key-words provided by the user. AbXtract (Andrade et al., 1998) labels the protein function in the input text and XplorMed (Perez-lratxeta et al., 2000) presents the user specified information through the interaction with user. GENIES
(Friedman et al., 2001) discovers more complicated information such as pathways from journal abstracts. Recently, MedScan (Daraselia et al., 2004) employed full-sentence parsing technique for the extraction of human protein interactions from MEDLINE.
Generally, these conventional systems rely on basic natural language processing (NLP) techniques when analyzing literature data. And the efficacy of such systems heavily depends on the rules for processing raw information. Such rules have to be refined by human experts, entailing the possibility of lack of clarity and coverage. In order to overcome this problem, we used machine learning techniques in combination with natural language processing techniques to analyze the interactions among the biological entities. Our method also incorporated several data mining techniques for the extensive discovery, i.e. detection of the interactions which are not directly described in the text.
We have developed PubMiner (Publication-based Text Mining system) which performs efficient interaction mining of biological entities such as gene, protein, and enzymes. For the performance evaluation, the budding yeast (S. cerevisiae) was used as the model organism. The goal of our text mining system is to design and develop an information system that can efficiently retrieve the biological entity-related information from the MEDLINE, where the biological entity-related information includes biological function of entities (e.g., gene, protein, and enzymes etc.), related gene or protein, and relation of gene or proteins. Especially we focus on interactions between entities.
PubMiner, a machine learning based text mining platform, consist of three key components: natural language processing, machine learning based inference, and visualization module.
The interaction extraction module is based on the NLP techniques adapted to take into account the properties of biomedical literature. It includes a part-of-speech (POS) tagger, a named-entity tagger, a syntactic analyzer, and an event extractor. The POS tagger based on hidden Markov models (HMMs) was adopted for tagging biological words as well as general ones. The named- entity tagger, based on support vector machines (SVMs), recognizes the region of an entity and assigns a proper class to it. The syntactic analyzer recognizes base phrases and detects the dependency represented in them. Finally, the event extractor finds the binary relation using the syntactic information of a given sentence, co-occurrence statistics between two named entities, and pattern information of an event verb. General medical term was trained with UMLS meta-thesaurus (Humphreys etal., 1998) and the biological entity and its interaction was trained with GENIA corpus (Kim et al. 2003). And the underlying NLP approach for named entity recognition is based on the system of Hwang (Hwang etal., 2003) and Lee (Lee etal., 2003). Figure 1 shows the schematic architecture of information extraction module.
The relation inference module, which finds common features and group relations, is based on data mining and machine learning techniques. A set of features of each component of the interaction are collected from public databases such as Saccharomyces Genome Database (SGD) (Christie et al., 2004) and database of Munich Information Center for Protein Sequences (MIPS) (Mewes etal., 2004) and represented as a binary feature vector. An association rule discovery algorithm, Apriori (Agrawal et al., 1993) was used to extract the appropriate common feature set of interacting biological entities. In addition, a distribution-based clustering algorithm (Slonim et al., 2000) was adopted to analyze group relations. This clustering method collects group relation from the collection of document which contains various biological entities. And the clustering procedure discovers common characteristics among members of the same cluster. It also finds the features describing inter-cluster (between clusters) relations. PubMiner also provides graphical interface to select various options for the clustering and mining. Finally, the hypothetical interactions are generated for the construction of interaction network. The hypotheses correspond to the inferred generalized association rules and the procedure of association discovery is described in the Section of ’Methods.' A set of inferred relations as well as the relations from text analysis are stored in the local database in a systematic way for efficient management of information. Figure 2 describes the schematic architecture of relation inference module.
The visualization module shows interactions among the
biological entities as a network format. It also shows the documents from which the relations were extracted and inferred. In addition, diverse additional information, such as the weight of association between biological entities could be represented. By this, the user can easily examine the reliability of relations inferred by the system. Moreover, the visualization module shows interaction networks with minimized complexity for comprehensibility and can be utilized as an independent interaction network viewer with predefined input format. Figure 3 shows the overall architecture of visualization module and its interface.
In our application, each interaction event is represented by their feature association. Thus it is very important to select optimal feature subset is important to achieve the efficiency of system and to eliminate non-informative
association information. Therefore, PubMiner uses feature dimension reduction filter (FDRF) to achieve these objectives.
Each feature of data is considered a random variable and the entropy is used as a measure of the uncertainty of the random variable. The entropy of a variable X is defined as:
According to this measure, a feature Y is considered to be more correlated to feature X than feature Z, if IG(X\ Y) > IG(Z\ Y). Symmetry is a desired property for a measure of correlations between features and information gain. However, information gain is biased in favor of features with more values and the values have to be normalized to ensure they are comparable and have the same affect. Therefore, here we use symmetrical uncertainty as a measure of feature correlation (Press et al., 1998), defined as follows:
With symmetrical uncertainty (SU) as feature association measure, we define feature selection procedure which is similar to the definition of Yu (Yu et al., 2003) to reduce the computational complexity. To decide whether a feature is relevant to the protein interaction (interaction class) or not, we use ^-correlation and /-correlation which use the threshold SU value decided by user. In the proposed method, the class Cis
divided into two class, conditional protein class (C c ) and result protein class (C R ) of interaction. Figure 4 shows
the overall procedure of informative feature selection
and the procedure is conducted for each interaction class.
The two feature scoring measures used in the procedure
of Figure 4, c-correlation (SU irC ) and /-fcorrelation (SU jfi ), are defined as follows:
Definition 1 (c-correlation SU if c and /-fcorrelation
SUj ). Assume that dataset S contains N
features and a class C (C c or C R )- Let SU ic denote the
SL7 value that measures the correlation between a feature fi and the class C (call as c-correlation), then the subset S of relevant feature can be decided by a threshold SU value & such that And the pair-wise correlation between all features (call as /- correlation) can be defined in same manner of c- orrelation with threshold value & /-correlation is used to decide whether relevant feature is redundant or not when considering it with other relevant features.
To predict implicit interaction between entities with feature association, we use conventional data mining method. For this, we adopt association rule discovery algorithm (so-called Apriori algorithm) proposed by Agrawal (Agrawal et al., 1993). Generally, association rule R(A=>B) -has two values, support and confidence, representing the characteristics of the association rule. Support (SP ) represents the frequency of co-occurrence of all the items appearing in the rule. And confidence (CP) represents the accuracy of the rule computed by dividing the support value by frequency of co-occurrence conditional part items of the rule. These are defined as
SP(A=>B) = P(AUB), CF(A=^B) = P(B\A) (0.5)
where A=^B represents association rule, A and B represent items (set of features) in that order. Association rule can be discovered by detecting all the possible rules whose supports and confidences are larger than the user-defined threshold values called minimal support
(£^min) and minimal confidence (CF min ) respectively.
Rules that satisfy both minimum support and minimum confidence threshold are called strong. Here we consider this strong association rules as interesting ones. An interaction is represented as a pair of two entities that directly binds to each other. To analyze interaction of entities with feature association, we consider each interacting entity pair as transaction of mining data. These transactions with binary vector representation are
described in Figure 5. Then we extract associative features generally representing the interaction with association rule mining.
In order to test our entity recognition and interaction extraction module, we built a corpus from 1,000 randomly selected scientific abstracts from PubMed
identified to contain biological entity names and interactions via manual searches. The corpus was manually analyzed for biological entities such as protein, gene, and small molecule names in addition to any interaction relationships present in each abstract within the corpus by biologist in our laboratory. Analysis of the corpus revealed 5,928 distinct references to biological
entities and a total of 3,182 distinct references to interaction relationships. Performance evaluation was done over the same set of 1,000 articles, by capturing the set of entities and interactions recognized by the system and comparing this output against the manually analyzed results previously described. Table 1 shows the statistics of abstract document collection for extraction performance evaluation.
where, TP (true positive) is the number of biological entities or interactions that were correctly identified by the system and were found in the corpus. FN (false negative) is the number of biological entities or interactions that the system failed to recognize in the corpus and FP (false positive) is the number of biological entities or interactions that were recognized by the system but were not found in the corpus. Performance test results of the extraction module in the PubMiner are described in Table 2.
To test the performance of inference of PubMiner through feature selection (reductions), we used proteinprotein interaction as a metric of entity recognition and interaction extraction. The major protein pairs of the interactions are obtained from the same data source of Oyama (Oyama et al., 2002). It includes MIPS, YPD and Y2H by Ito et al. and Uetz et al., respectively (Mewes et al., 2004). Additionally, we also used SGD (Christie et al., 2004) to collect more plentiful feature set. Table 3 shows the statistics of interaction data for each data sources and the filtering result with FDRF.
We performed feature filtering procedure of Figure 4 as a first step of our inference method (<5=0.73) after the
feature encoding with the way of Figure 5. Next, we performed association rule mining under the condition of minimal support 10% and minimal confidence 75% on the protein interaction data which have reduced features. And with the mined feature association, we predicted new proteinprotein interaction which have not been used in association training setp. The accuracy of prediction is measured whether the predicted interaction exists in the collected dataset or not. The results are measured with 10 cross-validation.
Here, we presented a biomedical text mining system, PubMiner, which screens the interaction data from literature abstracts through natural language analysis, performs inferences based on machine learning and data mining techniques, and visualizes interaction networks with appropriate links to the evidence article. To reveal more comprehensive interaction information, we employed both the data mining approach with optimal feature selection method in addition to the conventional natural language processing techniques. The proposed method achieved the improvement of both accuracy and processing time.
Table 4 gives the advantage of obtained by filtering non-informative (redundant) features and the inference performance of PubMiner. The accuracy of interaction prediction increased about 3.4% with FDRF. And the elapsed time of FDRF based association mining, 143.27 sec, include the FDRF processing time which was 19.89 sec. The elapsed time decrease obtained by using FDRF is about 32.5%. Thus, it is of great importance to reduce number of feature of interaction data for the improvement of both accuracy and execution performance. Thus, we can guess that the information theory based feature filtering reduced a set of misleding or redundnt features of interaction data and this feature reduction eleminated wrong associations and boosted the pocessing time. And the feature association shows the promising results for inferencing implicit interaction of biological entities. From the result of Table 4, it is also suggested that
with smaller granularity of interaction (i.e., not protein, but a set of features of proteins) we could achieve further detailed investigation of the proteinprotein interaction. Thus we cansay that the proposed method is a somewhat suitable approach for an efficient analysis of interactive entity pair which has many features as a back-end module of the generalliterature mining and for the experimentally produced interaction data with moderate false positive ratios.
However, current public interaction data produced by such as high-throughput methods (e.g. Y2H) have many false positives. And several interactions of these false positives are corrected by recent researches through reinvestigation with new experimental approaches. Thus, study on the new method for resolving these problems related to false positive screening further remain as future works.
|
Annnotations
- Denotations: 0
- Blocks: 0
- Relations: 0