@ewha-bio:22
Classification of Human Papillomavirus (HPV) Risk Type via Text Mining
Human Papillomavirus (HPV) infection is known as the
main factor for cervical cancer which is a leading cause of cancer deaths in women worldwide. Because there are more than 100 types in HPV, it is critical to discriminate the HPVs related with cervical cancer from those not related with it. In this paper, the risk type of HPVs using their textual explanation. The important issue in this problem is to distinguish false negatives from false positives. That is, we must find high-risk HPVs as many as possible though we may miss some low-risk HPVs. For this purpose, the AdaCost, a cost-sensitive learner is adopted to consider different costs between training examples. The experimental results on the HPV sequence database show that the consideration of costs gives higher performance. The improvement in F-score is higher than that of the accuracy, which implies that the number of high-risk HPVs found is increased.
Cervical cancer is a leading cause of cancer deaths in women worldwide. It, moreover, is the first cause of cancer deaths in Korean women. Since the main etiologic factor for cervical cancer is known as high-risk Human Papillomavirus (HPV) infection (Schiffman et al., 1993), it is now largely a preventable disease. HPV is a double-strand DNA tumor virus that belongs to the papovavirus family.
There are more than 100 types of HPV that are specific for epithelial cells including skin, respiratory mucosa, and the genital tract. Genital tract HPV types are classified by their relative malignant potential into low-risk and high-risk types (Janicek et al., 2001). The common, unifying oncogenic feature of the vast majority of cervical cancers is the presence of high-risk HPV. Therefore, the most important thing for diagnosis and therapy is discriminating whether patients have the high-risk HPVs and what HPV types are highly risky.
One way to discriminate the risk types of HPVs is using a text mining technique. Since a great number of research results on HPV have been already reported in biomedical journals (Furumoto and Irahara, 2002; Ishji, 2000), they can be used as a source of discriminating HPV risk types. One problem in discriminating the risk types is that the costs of high-risk HPVs and low-risk HPVs are not identical. This is because high-risk HPVs are seldom while low-risk HPVs are abundant. In addition, in classifying the risk types of HPVs, it is important to distinguish false negatives from false positives. That is, it is not critical to classify the low-risk HPVs as high-risk ones, because they can be investigated by further empirical study. However, it is fatal to classify the high-risk HPVs as low-risk ones. In this case, dangerous HPVs can be missed, and there is no further chance to detect cervical cancer by them.
Most machine learning algorithms for classification problems have focused on minimizing the number of incorrect predictions. However, this kind of learning algorithms ignores the differences between different types of incorrect prediction cost. Thus, recently, there has been considerable interest in cost-sensitive learning (Provost and Fawcett, 1997). Ting and Zheng (1998) proposed two related but different cost-sensitive boosting approaches for tree classification. Their approaches can be applied only to situations where the costs change very often. To apply boosting to situations where misclassification costs are relatively stable, Fan etal. (1999) proposed the AdaCost. In this paper, we propose a cost-sensitive learning method to classify the risk types of HPVs using their textual explanation. In classifying their risk types, we consider the learning costs of each example, because it is far more important to reduce the number of false negatives 1 * than to reduce that of false positives. For this purpose, we adopt
AdaCost as a learning algorithm and prove empirically that it shows great performance in classifying the HPV risk types.
One advantage of this work is usefulness for designing the DNA-chip, diagnosing the presence of Human Papillomavirus in cervical cancer patients. Since there are about 100 HPV types, making the DNA chip needs to choose some dangerous ones related with cervical cancer among them. Therefore, this result classifying the risk of HPVs can be a big help to save time to understand information about HPV and cervical cancer from many papers and to choose the HPV types used for DNA chip. The rest of this paper is organized as follows. Section 2 expresses the problems of normal machine learning algorithms. Section 3 describes the cost-sensitive learning to classify HPV risk types. Section 4 explains how the HPV dataset is generated. Section 5 presents the experimental results. Finally, section 6 draws conclusions.
First, let us check what happens unless we consider the cost of each learning example. We classify the risk type of HPVs by their textual explanation given by Los Alamos National Laboratory. The details of this explanation will be explained in Section 4. Table 1 shows the classification result of HPVs according to their risk types. It is classified by the naive Bayes classifier (Lang, 1995) without considering the costs of the examples. This result is obtained when we used only seven HPVs as a training set. Among the seven HPVs, five are high-risk (HPV16, HPV18, HPV31, HPV33, HPV45), and the other two are
low-risk(HPV11 and HPV6). Because the risk types of these HPVs are well known (Levy et al, 1994), they are chosen to be a training set.
The number of tested HPVs is 69. Assuming that Table 2 below is correct, the risk type for four of 69 HPVs is not known, so that 65 HPVs are evaluated. Twenty among 65 HPVs are classified as high-risk and the remaining 45 are classified as low-risk, while there are only 12 high-risk HPVs in Table 2. Since 53 HPVs are correctly classified, the accuracy is 81.54%.
At first, this accuracy seems reasonable. However, four of 12 misclassified cases are false negative, and 8 are false positive. That is, this is not satisfactory because false negatives are fatal as stated above. The reasons why the method which ignores the cost does not achieve high performance can be summarized into two problems.
The first one is that too many high-risk HPVs are predicted. That is, there are only 12 high-risk ones in the tested HPVs, but the native Bayes predicted 20 HPVs as high- risk. Though we used only two low-risk HPVs, in fact there are far more low-risk HPVs than high-risk HPVs. Therefore, it is required to give a higher cost to high-risk HPVs during training.
The other problem is that there are some HPVs that are difficult to determine their risk types only by their textual explanation. For instance, HPV54 is explained by a single sentence which is "HPV-54 was first isolated from a patient with condyloma acuminata.” This problem is inevitable in text classification. Thus, it goes beyond interest of this paper and should be solved by further biomedical experiments.
In order to consider the misclassification cost of HPV risk types, we adopt the AdaCost algorithm (Fan et al, 1999).
The AdaCost is a variant of AdaBoost (Freund and Schapire, 1996) that uses the cost of misclassifications to update the training distribution on successive boosting rounds.
The algorithm is shown in Fig. 1. Let S = {(xi,ci,yi), A,(x
m,cm,y m )} be a training set where g <=[0,1] is a cost factor and
is additionally given to the normal and y^{-l,+l}. First of all, the distribution of each example is set to D ! (i) = ■ When t is an index to show the round of boosting, Dt(i) is the sampling weight given to (x‘> c 1 , y) at the /-th round. And, a > 0 is a parameter as a weight for weak learner ht at the r-th round, and its value is given as
where ZiD(i)yihi(xt)f}(i). And, 0(i) is a cost adjustment
function with two arguments, signfy^xi)) and a. If h t (x>) is
correct, then ffli) = -0.5a + 0.5, otherwise 0(i) = 0.5a + 0.5.
where Wd* denotes the Wh word in the document d } , the
weight of word occurring in document d, and I di I is the number of words in the document. Thus, when assuming P(l di I) is uniform, the best class y* of a document di is determined by
The main difference between AdaBoost and AdaCost is how the distribution Dt is updated. AdaCost has an
additional cost adjustment factor in updating D t (see step 4
in Fig. 1). As AdaBoost does, the weight of an instance will be increased if it is misclassified. Similarly, its weight will be decreased otherwise. However, the weight change is affected by the value of the cost factor. When an instance has a high cost factor, the weight change will be greater than that with a low cost factor.
We have previously proposed the BayesBoost algorithm and showed that it gives great efficiency in text filtering (Kim et al., 2000). It uses naive Bayes classifiers as its weak learner within AdaBoost. Assume that a document dt is composed of a sequence of words which is wn,Wi2,A,Wi\dii, and the words in a document are mutually independent one another and the probability of a word is independent of its position within the document. Though these assumptions are not true in real situations, naive Bayes classifiers showed rather good performance in text classification (McCallum and Nigam, 1998).
Due to the independence assumption, the probability that a document di is generated from the class yj can be expressed as
In order to calculate this probability, we need to determine P(wk\yj;§) and P(y | §). These two values can be estimated as
Here, IVl is the size of vocabulary.
One of the advantages of using naive Bayes classifier as a weak learner is that the naive Bayes utilizes term weights such as term frequency naturally. Moreover, because it is a probabilistic model, it provides a natural measure for calculating confidence ratios in AdaBoost. Thus, in this paper, we also use naive Bayes classifier as a weak learner of AdaCost.
In general, the research in biomedical domain starts from investigating previous studies in PubMed designed to
provide access to citations from biomedical literature and available via the NCBI Entrez retrieval system developed by National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) located at the National Institutes of Health (NIH). Most bioinformatics research that handles text information has focused on PubMed as its resource, because it includes most summaries and citations about biomedical literature. However, learning HPV risk types from PubMed is not an
easy work. The difficulties can be summarized with two reasons.
• The PubMed data are too sparse.
For example, there are 3,797 articles about HPV and cervical cancer in PubMed, but most of them do not discuss the risk of HPV directly. Thus, it is difficult to capture the risk of HPV from the articles. In addition, the term distribution is totally different according to the interest of the articles.
• Poor performance of NLP techniques
The current natural language processing (NLP) techniques are not for text understanding yet. They can not provide even correct syntactic information. The best thing we can expect from NLP techniques is morphological analysis and part-of-speech tagging. Thus, the articles need to be refined for further study.
In this paper, we use the HPV Sequence Database? in Los Alamos National Laboratory as a dataset. This papillomavirus database is an extension of the HPV compendiums published in 1994, 1995, 1996, and 1997 and provides the complete list of 'papillomavirus types and hosts and the records for each unique papillomavirus type. An example of the data made from this database is given in Fig. 2. This example is for HPV80 and consists of three parts: definition, source, and comment. The definition indicates the HPV type, the source explains where the information for this HPV is obtained, and the comment gives the explanation for this HPV.
To measure the performance of the results in the experiments below, we manually classified HPV risk types using the 1997 version of Human Papillomaviruses compendium and the comment in the records of HPV types. The classifying procedure is as follows. First, we divided roughly HPV types by the groups in the 1997 version of Human Papillomaviruses compendium. These groups are shown in Fig. 3. This tree, which contains 108 Papilloma Virus (PV) sequences, was computed for the L1 consensus primer region (CPR) using neighbor joining method and a distance matrix calculated with a modified Kimura 2-parameter model (transition/transversion ratio 2.0). Neighbor-joining analysis is a convenient and rapid way to get an initial estimate of branching relationships, especially when a large number of taxa are involved. In the figure, the outermost wide gray arcs show the five PV supergroups (A-E). Each tree branch is labeled with an abbreviated sequence name. For HP Vs the ‘type’ number alone is given in most cases, so the branch labeled 40 is thatofHPV40.
<source>
Human papillomavirus type 80.
</source>
<conunent>
The DNA genome of HPV80 (HPV15-related) was isolated from histologically normal skin, cloned, and sequenced. HPV80 is most similar to HPV 15, and falls within one of the two major branches of the B1 or Cutaneous/EV clade. The E7, El, and E4 orfs, as well as the URR, of HPV 15 and HPV80 share sequence similarities higher than 90%, while in the usually more conservative LI orf the nucleotide similarity is only 87%. A detailed comparative sequence analysis of HPV80 revealed features characteristic of a truly cutaneous HPV type [362]. Notice in the alignment below that HPV80 compares closely to the cutaneous types HPV 15 and HPV49 in the important E7 functional regions CR1, pRb binding site, and CR2. HPV 80 is distinctly different from the high-risk mucosal viruses represented by HPV 16. The locus as defined by GenBank is HPVY15176.
Mvuiiuiiaiu*
Second, if the type of the group is skin-related or cutaneous HPV, the members of the group are classified into low-risk type. Third, if the group is known to be high- risk type of cervical cancer-related HPV, the members of the group are classified into high-risk type. Lastly, we used the comment of HPV types to classify some types difficult to be classified. Table 2 shows the summarized classification of HPVs according to its risk.
In the all experiments below, we used only <comment> part. The comment for a HPV type can be considered as a document in text classification. Therefore, each HPV type is represented as a vector of which elements are tf • idf
values. In tf • idf N(w jf di) of Equation (1), the weight of a
word Wj appeared in the document di is given as
where $ is the frequency of w, in di and n is the number of documents where w, occurs at least once.
When we stemmed the documents using the Porter s algorithm (Porter, 1980) and removed words from the stop-
list, the size of vocabulary is just 1,434. Thus, each document is represented as a 1,434-dimensional vector.
Text classification has various measures to evaluate its performance. One of these is the break-even point (Lewis, 1995). However, Schapire et al. (1998) asserted that the break-even points are not very suitable for measuring the performance of classification algorithms.
In this paper, we evaluate the classification performance using the contingency table method. In this method, recall and precision are defined as follows:
recall = • 100%
precision = • 100% (3)
accuracy = ' 100%
where a, b, c and d are defined in Table 3. The Fa-score which combines precision and recall is defined as
/? 2 • recall + precision 100%
where 0 is the weight of recall relative to precision. We use /? = 1 in all experiments, which corresponds to equal weighting of the two measures.
Since we have only 74 HPV types and the explanation of each HPV is relatively short, leave-one-out (LOO) cross- validation is used to determine the performance of the proposed method. We normalized each cost a to [0, 1]. That is, the cost for low-risk HPVs is set to 0.1 when the cost for high-risk HPVs is set to 0.9.
Fig. 4 demonstrates the performance of AdaCost. The graphs in this figure show the accuracy and F-score according to the round of AdaCost. Each graph represents the ratio of costs for high-risk and low-risk HPVs. For instance, figure (a) imposes 0.1 on high-risk HPVs and 0.9 on low-risk HPVs. Because the costs in figure (e) are both set to 0.5, it is the performance of the AdaBoost. Figures (a)-(d) plot the performance when lower costs are imposed on high-risk HPVs than those on low-risk HPVs. And, figures (f)-(i) plot the performance when higher costs are imposed on high-risk HPVs.
Generally, when we set different costs to low-risk and high-risk HPVs, higher performance is obtained than AdaBoost shown by figure (e) except the extreme cases represented by figure (a) and (i). Among nine graphs, figure (h) shows the best performance. It implies that 0.8 is the best cost for high-risk HPVs. It is also interesting to see
that figure (a) shows the worst performance. That is, in this case AdaCost shows worse performance than AdaBoost. Therefore, if we impose wrong cost, we may obtain worse result.
Table 4 summarizes the graphs in Fig. 4. These results are obtained when 50 weak learners are used in each AdaCost. The accuracy is similar with various costs, but different costs show different performance on F-score. As shown in Equation (3), precision and recall are related with the number of found high-risk HPVs while accuracy is related with the number of correctly predicted HPVs including both low-risk and high-risk HPVs. In our experiments, F-scores are higher than accuracies, which implies that less high-risk HPVs are missed by the
proposed method.
Table 5 shows the predicted risk type for the HPV types whose risks are not known exactly. These HPVs are described as ‘?’ in Table 2. According to previous
research on HPV (Chan et al., 1997; Favre et al., 1990; Meyer et al., 1998; Nuovo et al., 1988), HPV70 seems to be misclassified. This is because the comment for HPV70 does not describe its risk but because of its lack of biomedical research it explains only that it is found at the cervix of patients and its sequence is analyzed.
This paper proposed a practical method to determine the risk type of Human Papillomavirus. In classifying the risk type, it is important to distinguish false negatives from false positives, where false-negatives are high-risk HPVs that are misclassified as low-risk and false positives are low-risk HPVs misclassified as high-risk.
For this purpose, we set different costs for low-risk and high-risk HPVs. As a learning algorithm, we adopted AdaCost and showed empirically that it outperforms AdaBoost which does not consider learning cost. In addition, the experimental results gave higher F-score than accuracy, and it means that more high-risk HPVs are found by AdaCost. This result is important because high-risk HPVs, as stated above, should not be missed. Since HPV is known as the main cause of cervical cancer, high-risk HPVs must be found for further medical investigation of the patients.
Our results can be used as fundamental information to design the DNA-chips for diagnosing the presence of HPV in cervical cancer patients. Because the cost is too high to test all HPV types, the results presented in this paper reduce time and monetary cost to know their relation with cervical cancer.
|
Annnotations
- Denotations: 0
- Blocks: 0
- Relations: 0