Data preparation
Test exons were downloaded from FlyBase [38] using Drosophila melanogaster annotation version 4.2.1. Each gene locus was partitioned into non-overlapping intervals, with each interval containing all annotated overlapping exons. Test exons were selected from transcripts annotated with the same start/stop codon pair with splice sites located in the coding region. In a few cases, multiple overlapping test exons are spliced to neighboring exons in such a way that portions of the known exon sequence contain multiple functional overlapping reading frames. Currently ExAlt assumes the occurrence of a single reading frame, thus limiting its sensitivity in some cases. We plan to incorporate explicit prediction of multiple overlapping reading frames in the near future, to further improve sensitivity on the test set.
An initial set of 606 regions annotated with alternative splicing were searched for duplicate sequences. WU-blastn 2.0 [39] was used for an 'all against all' search to remove repeat sequence when two exons match with an E-value < 10-21, leaving 600 exon regions. Each remaining non-redundant exon region was extracted from the originating genome location with flanking intron sequence of 400 bases (or the length of the adjacent intron, whichever is shorter). 400 was chosen as a cutoff to limit the potential for aligning long stretches of poorly conserved intron sequence, while maintaining reasonably long stretches of sequence to predict alternative splicing patterns. WU-blastn 2.0 is used to find potential homologs in the three informant species D. simulans, D. yakuba, and D. erecta. The D. simulans genome was downloaded from the UCSC genome browser [40]. The D. yakuba and D. erecta genomes were downloaded from [41]. (D. simulans and D. yakuba sequence was generated by the Genome Sequencing Center, WUSTL School of Medicine and D. erecta sequence was generated by Agencourt.) The best matching sequence with an E-value < 10-19 was retained along with 50 bases of flanking sequence for input to the multiple sequence alignment program muscle [42]. All aligned sequences were required to differ in length by at most 10% to the query D. melanogaster sequence. N-SCAN and Augustus predictions were downloaded from the UCSC Genome browser [43,44] and SNAP predictions were downloaded from [45]. Branch lengths were obtained from [46].