@ewha-bio:89
Identification of Caenorhabditis elegans MicroRNA Targets Using a Kernel Method
MicroRNAs (miRNAs) are a class of noncoding RNAs found in various organisms such as plants and mammals.
However, most of the mRNAs regulated by miRNAs are
unknown. Furthermore, miRNA targets in genomes cannot be identified by standard sequence comparison since their complementarity to the target sequence is imperfect in general. In this paper, we propose a kernel-based method for the efficient prediction of miRNA targets. To help in distinguishing the false positives from potentially valid targets, we elucidate the features common in experimentally confirmed targets.
The performance of our prediction method was evaluated by five-fold cross-validation. Our method showed 0.64 and 0.98 in sensitivity and in specificity, respectively. Also, the proposed method reduced the number of false positives by half compared with TargetScan. We investigated the effect of feature sets on the classification of miRNA targets. Finally, we predicted miRNA targets for several miRNAs in the Caenorhabditis elegans (C. elegans) 3' untranslated region (3‘ UTR) database.
The targets predicted by the suggested method will help in validating more miRNA targets and ultimately in revealing the role of small RNAs in the regulation of genomes. Our algorithm for miRNA target site detection will be able to be improved by additional experimental- knowledge. Also, the increase of the number of confirmed
targets is expected to reveal general structural features that can be used to improve their detection.
MicroRNAs (miRNAs) are endogenous -22 nt RNAs that act as post-transcriptional regulators in animals and plants by binding to the mRNAs. They induce their cleavages or block their translation into proteins (Bartel, 2004). The first members of the miRNAs, lin-4 and let-7, were discovered in C. elegans, and they are components of the gene regulatory network that controls the timing of C. elegans larval development (Lee etal., 1993; Wightman etal., 1993; Ha etal., 1996; Moss eta/., 1997; Seggerson et al., 2002; Abrahante et al., 2003; Lin et al., 2003; Reinhart et al., 2000). Recently, it was reported that miRNAs are implicated in control of cell proliferation (Brennecke etal., 2003), cell death for fat metabolism in flies (Xu et al., 2003), control of leaf and flower development in plants (Aukeman et al., 2003; Chen, 2004; Emery et al., 2003; Palatnik et al., 2003), and hematopoietic lineage differentiation (Chen etal., 2004), though no targets were confirmed in these studies. These suggest the a broad range of possible functions for miRNAs. The overall importance of miRNAs has been further established by the notion that many miRNAs appear to have tissue-specific or developmental stage-specific expression patterns as well as their evolutionary conservation, which is very strong within mammals and often extends to invertebrate homologs (Lagos-Quintana etal., 2001; Lau etal., 2001; Lee etal., 2001; Lai, 2003; Lim et al., 2003a; Um et al., 2003b; Pasquinelli et al, 2000; Aravin et al., 2001; Lagos-Quintana etal., 2002; Lagos-Quintana et al, 2003; Ambros etal., 2003; Dostie etal., 2003; Houbaviy etal., 2003; Krichevsky etal., 2003).
However, although several functions of miRNA are uncovered, the factors and the mechanisms related to function of miRNAs are still unknown. The functional annotation of miRNAs is difficult because the size of miRNAs is small and the experiments for target prediction are not efficient. Therefore, a computational method to identify the target genes that are regulated by miRNAs would greatly help the study of miRNA function in animals (Ambros, 2001).
Targets for plant miRNAs have been identified on a
genome-wide scale by searches which require a high degree of sequence complementarity (Rhoades et al., 2003; Tang etal., 2003). However, most animal miRNAs are thought to recognize their mRNA 3 1 UTR via partial complementarity (Lee etal., 1993; Wightman etal., 1993; Moss etal., 1997; Reinhart etal., 2000; Zeng etal., 2002; Doench etal., 2003). Because of this partial complementarity, simple homology-based searches have failed to uncover targets for miRNAs (Ambros etal., 2003; Bartel and Bartel, 2003). Recently, carefully desigied computational approaches have been used to predict mRNA targets for Drosophila (Stark etal., 2003; Enright etal., 2003) and mammalian miRNAs (Lewis etal., 2003).
The methods used by Lewis etal. (2003) and Stark et al. (2003) incorporated conservation of the mRNA target site in the related organisms to separate signal from noise. However, the methods the have high false positive rate because they rely on inferences from a free energy of the miRNA/target duplex folding.
We used the kernel method to classify the miRNA targets, which is popular in modern statistical branch, particularly in probability density estimation and regression function approximation. In this paper, we propose an efficient kernel-based method that predicts C. elegans miRNA targets and present computational evidences that the most important factor to determine miRNA targets is the 5' region of miRNA.
Experimentally validated miRNA/target duplexes contain mismatches, gaps, and loops at different positions. Such
structures of the duplexes make it difficult to identify targets within whole-genome or transcriptome databases, since standard alignment methods produce many false positives with such short variable sequences. Furthermore, the small number of validated examples makes it difficult
the correct identification of miRNA targets by traditional classifiers.
In this paper, we analyzed the various characteristics of miRNA/target duplexes to reduce the false positive rate and classified the features into four major categories. Fig. 1 presents the four main characteristics of the miRNA/target duplexes. First, we used the structure information of miRNA/target duplexes. Even though the miRNA sequence has diverged, the secondary structure in confirmed miRNA/target pairs might be conserved. Therefore, we extracted the structural features based on the characteristic patterns of the secondary structure. Then, we calculated the distances of RNA secondary structures using the RNAdistance software (Hofacker et al., 1994) and compared the similarity of secondary structure of miRNA/target duplex using a Markov Chain model. Also, we counted the number of match and mismatch sites in the miRNA/target duplexes.
Second, we used the features of the miRNA 5 1 region. The ability of a miRNA to translationally repress a target mRNA is largely dictated by the free energy of binding of the first eight nucleotides in the 5 1 region of the miRNA (Doench et al., 2004). Moreover, the 5' ends of related miRNAs tend to be better conserved than the 3 1 ends (Lim etal., 2003; Mallory etal., 2004), further supporting the hypothesis that these segments are most crucial for miRNA target recognition. Therefore, we used the structure and the free energy data extracted from the miRNA 5 1 region/target duplexes.
Third, we used the free energy of miRNA/target duplexes formation. The pairing of the miRNA 5' region to the mRNA is sufficient to cause repression, and the free energy value of this interaction is an important determinant of activity. The 3 1 region of the miRNA is less critical, but can modulate activity in certain circumstances (Doench et al., 2004). Therefore, we calculated the free energy of three different parts, the free energy of miRNA/ mRNA duplex, miRNA 5 1 region/mRNA duplex and miRNA 3' region/mRNA duplex.
Lastly, we used the G:U wobble pairing feature because G:U wobble pairing is highly detrimental to miRNA function despite its favorable contribution to RNA:RNA duplexes (Doench et al., 2004). Table 1 presents all features described above.
Mfold (Zuker, 2003) is a program package for the RNA secondary structure prediction using nearest neighbor thermodynamic rules. We searched for the most stable binding site with RNA sequences from 3'UTR database using MFold. We calculated the free energy, the number of G:U wobble pairs and the number of mismatch and match through this program.
We used the RNAdistance software of the Vienna RNA package (Hofacker et al. ,1994) to calculate distances between the analysed RNA secondary structures. RNAdistance accepts structures in bracket format, where matching brackets symbolize base pairs and unpaired bases are represented by a dotor coarse grained representations where hairpins, interior loops, bulges, multi loops, stacks and external bases. We calculated the distance by
where is the number of positive training data, ^ uerp
is a query structure, and st^ is a structure of positive training data, (see the eighth feature in Table 1)
Markov Chain is a random process which has the
property that the next state is conditionally independent of the past given the current state. It is useful for biological structure analysis because of their ability to incorporate biological information in their structure. We calculated the Markov Chain probability according to the frequency of states that are given in Fig. 2. The structural probability was calculated by
where /(^)is the structural probability for position / and /, (con) is a low-valued constant to prevent log going to zero and p^x^) is the background probability. We constructed the Markov Chain as equation (3) and (4) when s and t is a given state.
We used the Markov Chain probability to compare a RNA secondary structure of miRNA/mRNA duplex, (see the ninth and tenth features in Table 1)
We used a support vector machine (SVM) to classify miRNA targets from mRNA 3 1 UTR database. This method has attracted a lot of attention by its successful application in pattern recognition (Scholkopf et al., 1999). The kernel trick used in SVM is applicable not only for classification but also for other linear techniques (Vapnik, 1998). SVM is a method of obtaining the optimal boundary of two sets in a vector space independently on probabilistic distributions of training vectors in the sets.
Its fundamental idea is locating the boundary that is most distant from the vectors nearest to the boundary in both of the sets. Let x be a vector in a vector space. A boundary hyperplane is expressed in the form of /(z) = wW6 where w is a weight coefficient vector and b is a bias term. The distance between a training vector Xi and the boundary, called margin, is expressed and reduced as maximization of 1/ w 2 . Consequently,
the optimization is formalized as this.
This conditional optimization can achieved by Largrange’s method of indeterminate coefficient.
If the sets are not
linearly separable, the slack variable C is allowed to exist in a limited region in the erroneous side along the boundary. Also, in this paper, we used the polynomial kernel K(x,x) = +1 and sequential minimal
optimization (SMO) algorithm (Keerthi etal., 2001; Platt et al., 1999) to learn our SVM. The SMO that can be viewed as the most extreme case of decomposition
methods is the most popular optimization algorithms for SVM. It allows for fast convergence with small memory requirements even on large problem.
The learning methods used in this study were obtained from the WEKA machine learning package. We used the SMO algorithm with the complexity parameter C= 4 and the polynomial kernel exponent = 5. All other parameters were default values.
The overall process of our method is described in Fig.3.
We extracted structural information, free energy of miRNA/target site interaction and 8 nucleotides (nt) information of miRNA 5' end as input data from negative and positive datasets. Then, input data was classified by the kernel method implemented by SVM. The computational experiments in this study were performed by the WEKA machine learning package (Witten and Frank, 1999) ( http://www.cs.waikato.ac.nz/~ml/weka/ ).
The training data for the SVM classifier is a set of RNA/RNA pairs divided into positive and negative samples. The positive training set consists of 39 experimentally defined miRNA/target pairs (Slack et al., 2000; Lin ef al., 2003; Banerjee etal., 2002; Poy etal., 2004; Stark etal, 2003). The positive samples include seven pairs of lin-14/cel-let-7, three pairs of lin-14/cel-lin-4, one pair of lin-28/cel-lin-4, one pair of lin-28/cel-let-7, one pair of lin-41/cel-lin-4, one pair of lin-41/cel-let-7, five pairs of daf-12/cel-let-7, one pair of hbl/cel-lin-4, ten pairs of hbl/cel-let-7, four pairs of hid/dme-bantam, one pair of HLHm3/dme-miR-7, one of hiary/dme-miR-7, one pair of rpr/dme-miR-2, one pair of grim/dme-miR-2 and one pair of Mtpn/mir-375.
The negative training set consists of 1022 random
sequence/target pairs. We searched for the high-affinity
binding sites with random sequences from C. elegans 3'UTR database ( ftp://bighost.ba.itb.cnr.it/pub/Embnet/ Database/UTR/data/) to make the negative training set. The random sequences similar in length to miRNAs (approximately 18-22 nt) were produced by site-independent sampling. We extracted the random sequence/target pairs that have more than six perfect Watson-Crick pairs of 8 nt from the miRNA 5' end, and that have a thermodynamicaly stable free energy less then -8.5 kcal/mole. Also all sequence pairs with structures like Fig.4, consisting of additional branch structures, were removed from the negative data since those structures
were not found in the positive datasets.
The objective of this study is to construct classifiers that can correctly classify the miRNA target genes from the 3'UTR database. The performance of our classification method was evaluated by five-fold cross-validation. That is, the whole data set was partitioned into five subsets. The four of the subsets were used as a training test, and the rest were used as a test set, and this process was repeated five times. Table 2 shows the sensitivity and the specificity of the five-fold cross-validation in classifying the miRNA target genes using the kernel machine. The sensitivity was 0.64 and the specificity was 0.98.
Also, we compared the performance to TargetScan (Lewis etal., 2003) with 32 of 39 training data and 1751 random negative data. TargetScan combines thermodynamics based modelling of RNA/RNA duplex interactions with comparative sequence analysis to predict miRNA targets conserved across multiple genomes. Fig.5 shows that
the sensitivity and the specificity of our method using SVM are similar to or better than TargetScan program. However, the more important thing is that the number of false positives is much lower than TargetScan, that is, 35 false positives by our method compared with 77 in TargetScan. This shows that our method is more efficient and correct than TargetScan.
We applied our method to fet-7and lin-4 miRNA genes,
whose targets are known (Wightman et al., 1993; Ha et al., 1996; Moss etal., 1997; Olsen etal., 1999; Seggerson etal., 2002; Slack etal., 2000; Abrahante etal., 2003; Lin et al., 2003; Banerjee et al., 2002) (Tables 3 and 4). These predicted miRNA targets contain all the known let-7 targets: lin-41, daf-21, lin-14, hbl and lin-28. This shows that most miRNA/target pairs can possibly be detected by an SVM classifier with low specificity in genome-wide searches. Table 5 presents the miRNA targets of miR-228, miR-229, miR-230 and miR-231.
In this section, we investigate the effect of feature set on the performance of miRNA target classifier. We excluded a feature from the entire feature set one by one and examined how much each feature contributes to the performance of the classifier. Fig. 6 presents each feature’s influence on the performance. The feature having lower sensitivity will be a more important factor to decide miRNA targets. The top-three features having the lowest sensitivity were the number of G/U wobble pairs at the 8 nt of miRNA 5' region (feature (4)), the number of matches at the 8 nt of miRNA 5 1 region (feature (1)), and the similarity of secondary structure of miRNA 5' region /mRNA duplex using a Markov chain model (feature (10)). All of them are related to the miRNA 5‘ region. Every experiment has a similar specificity (97% - 98%). Moreover, Fig. 7 shows that features the miRNA 5' region are the most important information to decide whether miRNA target or not, further supporting the hypothesis
that the miRNA 5 1 region is most critical for miRNA target recognition (Doench etal., 2004).
The application of a computational approach to the prediction of miRNA targets is often hindered by the small size of miRNA sequences. The prediction of miRNA targets is more complicated due to the tendency of imperfectness in miRNA/mRNA pairings in animals. We presented a kernel-based classification method to overcome these problems and tried to identify potential miRNA targets with an RNA-folding program that evaluates the structural and thermodynamic plausibility of the predicted pairs and distinguishes the real from the random matches. Kernel-based statistical learning methods have a number of advantages for the analysis of not only vectorial and matrix data which are common in classic statistical analysis but also more exotic data types such as string, trees and graphs. The ability to handle such data is clearly essential in the biological domain. The kernel-based method provides significant opportunities for the incorporation of more specific biological knowledge and unlabelled data.
We demonstrated that the SVM classifier for the computational identification of miRNA target sites can detect miRNA target sites with high specificity. The result of this method will be able to provide better understanding of how miRNAs bind their targets. To help distinguish the false positives from potentially valid targets, we identify the features shared by valid targets. Also, the method can be applied to other species as well, because many of these miRNAs are phylogenetically conserved, suggesting strong evolutionary pressure. The functional target sites are conserved in homologous genes from related species (Moss and Tang, 2003), so we can improve the performance through the analysis of the orthologous 3'UTRs. Efforts to find more animal miRNA targets will be greatly helpful because of the deeper understanding of structural and biochemical nature of miRNA/mRNA pairing. The targets predicted by the proposed method will help in validating more miRNA targets and ultimately in revealing the role of these small RNAs in the regulation of the genome.
|
Annnotations
- Denotations: 0
- Blocks: 0
- Relations: 0