PMC:3475488 / 6493-11976
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"23105930-18478122-44845670","span":{"begin":4359,"end":4361},"obj":"18478122"},{"id":"23105930-16868079-44845671","span":{"begin":4709,"end":4711},"obj":"16868079"},{"id":"23105930-17617639-44845672","span":{"begin":4903,"end":4905},"obj":"17617639"}],"text":"Results and Discussion\n\nA comparison of BLAT and Sim4cc\nAs a prerequisite, it was assumed that the intron annotated information was correct and complete. Then, the software's results were compared with the annotated information. Three sets of results of intron information were obtained: two sets from the software (BLAT and Sim4cc) and one set from the annotated information (Table 4).\nUsing BLAT, we found 99.35% and 99.87% of the introns of all rice and Arabidopsis annotated introns, respectively. These introns were almost all of the introns in the genome - that is, only 0.13% to 0.65% of the introns were not found. In contrast, by using Sim4cc, 95.15% to 98.98% of the introns were found (1.02% to 4.85% of the introns were lost) of all rice and Arabidopsis annotated introns. In summary, BLAT got more of the introns in a genome than Sim4cc. In light of this result, it seems as though that BLAT produces better results than Sim4cc.\nWe found 30,194 rice genes with at least one intron by BLAT, but the number was 30,177 according to the annotated information. Because the BLAT results were larger than the annotated results, the BLAT results must have predicted some new and different genes with introns. In the BLAT results, many short-length introns (less than 50 bp) were predicted, but in fact, these short-length introns were part of transcript sequences and were not real intron sequences. In contrast, Sim4cc detected 29,875 genes with introns, and all of these genes were contained in the annotation information. The predicted intron accuracy rate of Sim4cc was 100%. On accuracy, Sim4cc was better than BLAT.\nIf Sim4cc is used, the user has to splice a whole genome file to many files: one gene, one file. The computing process of Sim4cc was more complex than that of BLAT, and each time, Sim4cc only calculated one cDNA sequence to one gene sequence; so, the executing efficiency and speed are not high. In comparison, BLAT was easier and faster than Sim4cc.\nIn conclusion, BLAT and Sim4cc can be used to predict introns, but each of them has its advantages and disadvantages. The comparative results are summarized in Table 5. Sim4cc was a cross-species spliced alignment program. In our study, Sim4cc was used to find introns by comparing cDNA sequences and gene sequences. The correct intron can be obtained by comparing one cDNA sequence with its own gene sequence. But, a lot of introns were lost by Sim4cc. In other words, Sim4cc was good at detecting the correct intron but not at predicting the whole number of introns in a genome. In contrast, BLAT can predict most of the introns - nearly all of the total introns in a genome. But, there were some false-positive predictions of introns. However, the proportion of this error was very small. As a result, BLAT will be proposed to annotate plant genome introns.\n\nIntron length distribution of 10 plants\nAccording to Roy's method, many predicted introns in the plant genomes had in-frame stop codons, and the predicted introns in these genomes were equally as likely to be a multiple of 3 bp (3n) as to contain a plus one (3n + 1) or two (3n + 2) bp. Here was an example of three phases from an Arabidopsis thaliana gene, AT1G17600.1 (Fig. 2).\nBy analyzing genome sequence annotations, we got three-phase intron distributions for 10 plant species (Table 6). If the plant intron annotation is more accurate, the number of three phases should be similar (one-third each). For 80% (8/10) of species, there were similar numbers of the three phases. It should be noted that most of these plant species annotations were the best annotations to date, but new annotations will be continually released to correct errors and false-positive results.\n\nTwo-species 3n intron skew analysis\nFor all of the 10 genomes (Table 6), there were very similar numbers of 3n + 1 and 3n + 2 introns, and the percentages of 3n + 1 and 3n + 2 introns were within 0.8%. In contrast, the number of 3n introns varied much more widely, from 29.1% to 47.7%. In this study, two species' genome introns showed strongly skewed percentages, in that the 3n intron percentage was much lower or higher than the expected value (one-third). Such a skew suggests systematic errors in the intron prediction.\nThe green alga Ostreococcus lucimarinus has one of the highest gene densities known in eukaryotes, with many introns [28]. There was a striking excess of predicted 3n introns (47.7% of all predicted introns, 1,130) compared to 3n + 1 (25.8%, 611) and 3n + 2 (26.5%, 628) introns. In this case, many predicted 3n introns were not true introns but instead exons.\nThe unicellular green alga Ostreococcus tauri is the world's smallest free-living eukaryote known to date [29]. These predicted introns showed a deficit of 3n introns (29.1%, 1,262), much lower than 3n + 1 (35.8%, 1,553) and 3n + 2 (35%, 1,519) introns. This result is very close to previous studies [18]. In this case, 3n introns may be mistakenly regarded as coding sequences, whereas a 3n + 1 or 3n + 2 intron may be inferred from the disruption of the coding frame.\n\nConcluding remarks\nBy comparing the advantages and disadvantages of BLAT and Sim4cc in intron prediction, we found that BLAT is faster and can predict more introns than Sim4cc. Through using intron length distribution to detect introns' annotations, it is a simple and fast method for detecting a variety of possible systematic biases in intron prediction or even for detecting problems with genome assemblies."}