PMC:1892782 / 14559-16819 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"17540014-15665081-1690050","span":{"begin":103,"end":105},"obj":"15665081"},{"id":"17540014-15665081-1690051","span":{"begin":1248,"end":1250},"obj":"15665081"}],"text":"1.2 Training of Support Vector Machine\nAlignments for training were taken from the same sources as in [15] including representatives for rRNAs, spliceosomal RNAs, tRNAs, miRNAs, small nucleolar RNAs, nuclear RNaseP and SRP RNA. Sequence similarity in this data set ranges from 47% to 99% mean pairwise identity in alignments of 40 nt to 400 nt length and of 2 to 6 sequences. The detailed distributions of mean pairwise identity, length, number of sequences and GU base pair content are given in the supplementary material (see Additional file 1). A total of 5886 ClustalW alignments, approximately equally representing these ncRNA families, were used for training after removing alignments that were not recognized as structured RNA by RNAz in both reading directions. This data set was splitted into two subsets of equal size, namely the positive and negative training set. Alignments in the negative training set were transformed to the reverse complement and realigned with ClustalW as opposed to take just the reverse complementary alignment of the structured RNA.\nThe number of sequences a training alignment contains is limited to 6 as the SVM regression procedure to estimate the z-scores is trained with alignments of maximal 6 sequences [15]. In case an alignment has more than 6 sequences a subalignment with optimal mean pairwise identity may be chosen with the perl script rnazWindow.pl [17] of the RNAz package.\nWe use libsvm 2.8 [18] with SVM type C_SVC, a radial basis function (RBF) kernel, probability estimates and descriptor vectors scaled linearly to the interval [-1, +1]. The scaling avoids that descriptors which have a large variance dominate the classification. The values for the RBF kernel parameters γ and C were identified by a grid search in the parameter space applying grid.py of the libsvm 2.8 package with a 5-fold cross-validation on the training data. Maximal prediction accuracy is achieved with parameters C = 128 and γ = 0.5.\nThe SVM returns an estimated class probability p, that the ncRNA is found in the reading direction of the input alignment. We convert p into a score D = 2p - 1, so that D ≈ +1 means \"RNA in reading direction of input alignment\" while D ≈ -1 means \"RNA is reverse complement of input alignment\"."}