Background
High-throughput sequencing of expression data provides compelling evidence that the long held hypothesis "one gene produces one protein" is far less common than previously thought. Surveys from the human genome estimate that as many as 70% of human genes produce more than one transcribed form [1]. Examples are found in a variety of metazoan organisms confirming that a significant number of genes produce multiple distinct transcripts [2,3]. Alternative splicing is an important biological mechanism for producing multiple distinct transcripts from a single gene locus. Exon intron junctions are pieced together to produce differing mRNAs. In some cases alternative exon splicing leads to different functional proteins thereby increasing protein diversity. In other cases an alternatively spliced exon leads to non-functional mRNA, effectively regulating gene expression [3].
Given an input genomic sequence and the locations of gene regions, our goal is to find the functional exons originating from each gene locus, identifying their respective amino acid codons and splice sites. Figure 1 shows examples of alternatively spliced exons examined in this study: intron retention (IR), cassette exon (CE), and multiple splice sites (MS). Also considered are constitutive exons (CS), defined to be an exon included with the same splice site boundaries in all functional mRNA forms.
Figure 1  Three forms of alternative splicing: Intron Retention (IR), Cassette Exon (CE), and Multiple Splice sites (MS).

Related work
Gene expression provides evidence for large numbers of alternatively spliced genes [4-10]. The most reliable high throughput evidence for alternative splicing comes from full length cDNAs, which are limited in coverage across all biological states. Expressed Sequence Tags (ESTs) supplement the coverage of full length cDNAs but still fail to capture all expressed forms [11,12]. Genomic sequence patterns can potentially be used to identify alternative splicing in less commonly expressed genes and recent work has focused on developing computational methods to predict alternative splicing without direct evidence of gene expression. This work is divided into two types: explicit and implicit alternative splicing prediction.

Explicit alternative splicing prediction
Sorek et al. looked at cassette exons in human and mouse and found a striking pattern of increased intron conservation distinct from constitutive exons [13]. A list of features were compiled including exon length, sequence conservation and k-mer counts [14,15], which were used in a support vector machine (SVM) [15] to classify cassette and constitutive exons. Yeo et al. developed a regularized least-squares classifier, called ACESCAN [16], to identify cassette exons in human/mouse orthologs using a similar feature set. A SVM cassette exon classifier was developed for Caenorhabditis elegans using only single species features and was extended to predict cassette exons in intron sequence [17]. Drosophila melanogaster exons matched to Drosophila pseudoobscura orthologs with conserved flanking intron sequence were observed by Philipps et al. to be enriched for alternatively spliced exons [18].

Implicit alternative splicing prediction
An alternative approach is to predict multiple overlapping gene structures, or a single gene structure overlapping existing alternative annotation. Explicit features of alternative splicing are not scored, but by virtue of having multiple overlapping high scoring gene structures, alternative splicing is implied. One method sampled paths [19] in the generalized hidden Markov Model (GHMM) of the single isoform gene finder SLAM [20]. Re-occurring overlapping high scoring parses were reported as candidates for alternative splicing. Another approach is to find an exon splicing pattern with the highest scoring alignment to profile hidden Markov models (profile-HMMs) [21]. The human genome was searched for cassette exons and intron retention events using a reference annotation [22]. Predicted gene structures with scores exceeding the reference gene structure were inferred to be examples of alternative splicing.
The work most similar to the model introduced in this article is the pair-HMM UNCOVER [23], which finds exons in sequence annotated as introns and was tested on human/mouse intron pairs. Unlike the cassette exon classification methods [15-17], models were trained using examples of protein coding exons without explicitly distinguishing between constitutive exons and cassette exons. Since the input sequence is assumed to be an intron, predicted exons are inferred to be alternatively spliced.
The method presented in this article extends the GHMMs used in single isoform gene finding [24] to explicitly model features of alternative and constitutive exons. The features of the explicit alternative splicing prediction methods: k-mer counts, exon lengths, and sequence conservation are used to predict multiple splice sites and intron retention events along with cassette exons and constitutive exons. Cross-species sequence conservation is incorporated using components of the single isoform phylogenetic HMM gene finders [25-27]. The phylogenetic shadowing principle is used to assume a multiple sequence alignment can be obtained from closely related species [28]. In contrast, the pair-HMM method simultaneously predicts a pairwise alignment and the exon structure making it potentially better suited to incorporate a difficult to align, more distantly related organism. Conservation from greater evolutionary distances may improve discriminative power in identify functional nucleotides, but with the potential trade off of detecting a smaller set of conserved alternative splicing events [29].
The remainder of the article describes our computational prediction model and reports on prediction accuracy in Drosophila melanogaster.