Graphical model of alternative splicing The prediction model is designed to predict multiple overlapping exons in a sequence believed to contain a single exon or intron, rather than the complete gene from start codon to stop codon. Input is expected to be a target sequence previously annotated by a single-isoform gene annotation tool such as a gene finder, cDNA alignment or some other annotation source. In cases where the input sequence contains untranslated regions, it is assumed that the coding boundary is known. Thus, the problem of translation start/stop site prediction is not addressed here. Alternative splicing increases the number of candidate acceptor/donor pairs compared to the constitutive exon equivalent. Figure 2 shows four candidate splice sites, an acceptor site a0, and three donor sites d0, d1, and d2. In a single isoform gene finder, only one of the four exons labeled constitutive in Figure 2 represent a viable exon. Allowing for alternative splicing means all three donors sites are potentially functional. For example, in Figure 2, the eighth candidate splicing type from the top has two functional donor sites d0(marked MD1) and d1 (marked MD2), leading to two different functional exons. More than two functional donor or acceptor sites can occur leading to a model of unbounded size. The combinatorial possibilities are reduced to a finite number using one symbol for each splice type to represent functional splice sites over 2 in number. For example, MDN is the symbol used to represent the third functional donor site, d2 in Figure 2. Figure 2 Alternative and single isoform exon candidates. Four splice sites are shown, one acceptor a0, and three donor sites, d0, d1, and d2, and the begin (B) and end (E) of the input sequence. There are four candidate constitutive exons (Constitutive Exon), three candidate cassette exons (Cassette Exon), and candidate exons with multiple functional donor sites (Multiple Splice Site Exon). See text for a description of the splice types: single acceptor (SA), cassette acceptor (CA), single donor (SD), cassette donor (CD), multiple donor 1 (MD1), multiple donor 2 (MD2), and multiple donor above two in number (MDN). Donor sites are divided into five types: single functional constitutive donor SD, alternative donor for cassette exons CD and multiple functional donors MD1, MD2, and MDN. MD1 is the left most functional donor, MD2 is the donor immediately downstream of MD1, and MDN represents additional downstream donors. The classification scheme similarly extends to acceptor sites: SA (single constitutive acceptor), CA (acceptor for cassette exon), MA1 (first multiple acceptor), MA2 (second multiple acceptor), and MAN (multiple acceptors greater than 2). The intron retention splice site labeled GT in Figure 1 (the 5' end of the retained intron) forms the basis of the splicing types: SD-IR, MD1-IR, MD2-IR, and MDN-IR. The intron retention acceptor labeled AG in Figure 1 (3' end of the retained intron) forms the basis for the splicing types: SA-IR, MA1-IR, MA2-IR, and MAN-IR. There are five end of sequence conditions: beginning of the sequence (Beg), end of a constitutive intron (END-INTRONC), end of an alternative intron (END-INTRONA), end of a constitutive exon (END-EXONC), and end of an alternative exon (END-EXONA). Splice sites and end of sequence conditions are called signals and ordered signal pairs define the exon/intron intervals in an alternative exon splicing model. Figure 3 shows a portion of the model and two example sets of states aligned to genomic sequence. The model in Figure 3 predicts alternative splicing in internal exons for the three splicing types in Figure 1 plus constitutive exons. The states represent sequence intervals between pairs of signals. The top right example in Figure 3 shows an initial "Upstream Constitutive Intron" state between signal pair (Beg, SD), which marks an intron proximal to a constitutive splice site followed by states for each downstream exon interval: "Internal First Exon of IR" (SD,SD-IR), "Retained Intron" (SD-IR,SA-IR), and "Internal Last Exon of IR" (SA-IR,SD), ending in the "Downstream Constitutive Intron" state (SD,END-INTRONc). The bottom right example in Figure 3 shows "Upstream Alternative Intron" (Beg,MA1), "Multiple Acceptor 1" (MA1,MA2), "Single Donor" (MA2,SD), and "Downstream Constitutive Intron" (SD, END-INTRONC). States not shown in Figure 3 model rarer forms of splicing, including combinations of alternative splicing events and splicing in exons at the end of genes. The complete model is given in the Methods section. Figure 3 Left image shows a portion of the graphical model for alternatively spliced exons. The right side of the figure shows two examples of parsing a target sequence. The top right example parses an intron retention sequence and the bottom right example parses a multiple splice site sequence. Blue states output partial subsequence of alternatively spliced exons, beige states are exons beginning with an acceptor and ending with a donor. Green states are introns. Phylogenetic generalized hidden Markov model definition A phylogenetic generalized hidden Markov model (PGHMM) extends a model described in the single isoform gene finder Shadower [27]. The method described here models higher order nucleotide dependencies and is applied to the alternative exon splicing model introduced in Figure 3. An input multiple sequence alignment X = S0,..., Sm includes the target sequence S0 and m informant species. X[k] is the kth column in X and X[i,j] are the columns from position i to j inclusive, the PGHMM is defined to be a 7 tuple, (Q, π, Σ, R, ψ, O, L): • Q – the set of states with states q, q'. ∈ Q • Pπ(q) – the probability of beginning in state q • Σ – the set of nucleotides {A, C, G, T} emitted in the model • PR(q|q') – the transition probabilities from state q' to state q • ψ – the set of phylogenetic parameters • PO (X [i, j]|q, ψ) – the probability of emitting sequence alignment columns from i to j in state q using phylogenetic parameter set ψ • PL, q (j - i + 1) – the probability of the state q emitting the series of columns of length j - i + 1 The parse of multiple sequence alignment X is a series of partitions t = (t0, t1, ..., tn), with state qi outputting a contiguous series of columns in X from position bi to ei inclusive in partition ti = (bi, ei, qi). The parse spans the entire multiple sequence alignment X so that bi + 1 = ei + 1. The joint probability between parse t and sequence X is: P ( t , X ) = P O ( X [ b 0 , e 0 ] | q 0 , ψ ) × P π ( q 0 ) × P L , q 0 ( e 0 + 1 ) × ∏ i = 1 n P O ( X [ b i , e i ] | q i , ψ ) × P R ( q i | q i − 1 ) × P L , q i ( e i − b i + 1 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeWabaaabaGaemiuaaLaeiikaGIaemiDaqNaeiilaWIaemiwaGLaeiykaKIaeyypa0dabaGaemiuaa1aaSbaaSqaaiabd+eapbqabaGccqGGOaakcqWGybawcqGGBbWwcqWGIbGydaWgaaWcbaGaeGimaadabeaakiabcYcaSiabdwgaLnaaBaaaleaacqaIWaamaeqaaOGaeiyxa0LaeiiFaWNaemyCae3aaSbaaSqaaiabicdaWaqabaGccqGGSaaliiGacqWFipqEcqGGPaqkcqGHxdaTcqWGqbaudaWgaaWcbaGae8hWdahabeaakiabcIcaOiabdghaXnaaBaaaleaacqaIWaamaeqaaOGaeiykaKIaey41aqRaemiuaa1aaSbaaSqaaiabdYeamjabcYcaSiabdghaXnaaBaaameaacqaIWaamaeqaaaWcbeaakiabcIcaOiabdwgaLnaaBaaaleaacqaIWaamaeqaaOGaey4kaSIaeGymaeJaeiykaKIaey41aqlabaWaaebmaeaacqWGqbaudaWgaaWcbaGaem4ta8eabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0Gaey4dIunakiabcIcaOiabdIfayjabcUfaBjabdkgaInaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaemyzau2aaSbaaSqaaiabdMgaPbqabaGccqGGDbqxcqGG8baFcqWGXbqCdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=H8a5jabcMcaPiabgEna0kabdcfaqnaaBaaaleaacqWGsbGuaeqaaOGaeiikaGIaemyCae3aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWGXbqCdaWgaaWcbaGaemyAaKMaeyOeI0IaeGymaedabeaakiabcMcaPiabgEna0kabdcfaqnaaBaaaleaacqWGmbatcqGGSaalcqWGXbqCdaWgaaadbaGaemyAaKgabeaaaSqabaGccqGGOaakcqWGLbqzdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabdkgaInaaBaaaleaacqWGPbqAaeqaaOGaey4kaSIaeGymaeJaeiykaKIaeiOla4caaaaa@A1B6@ Each interval from b to e begins and ends with a signal covering a fixed length window, Sigb and Sige respectively. The donor signal window width Wdon, is set to 9 for all donor types (SD, CD, MD1, MD2, MDN, SD-IR, MD1-IR, MD2-IR, MDN-IR). When the window covers columns X [k, k + 8] from k to k + 8 the consensus splice site is in subsequence S0 [k + 3, k + 4] = GT. The acceptor window width Wacc, is set to 24 for all acceptor types (SA, CA, MA1, MA2, MAN, SA-IR, MA1-IR, MA2-IR, MAN-IR), covering columns X[k - 23, k] from k-23 to k with consensus splice site in subsequence S0[k - 3, k - 2] = AG. The beginning and ending sequence signals set the window parameter WSig to 0 since non splice site signals are not explicitly modeled. State q emitting columns X[b, e] from b to e models the downstream signal Sige but excludes the upstream signal Sigb. For example, when q = Multiple Donor 1, q outputs columns between two donor sites, Sigb = MD1 and Sige = MD2. The exon interval is scored from b to e-Wdon - 1 inclusive and the donor columns are scored from e - Wdon to e inclusive. The upstream donor site window MD1 spans the interval b - Wdon - 1 to b - 1 and is scored in the previous state. The probability of a state emitting a series of columns becomes: P O ( X [ b , e ] | q , ψ ) = ∏ k = b e P ( X [ k ] | S e l e c t M o d e l ( X , e , k , z , q , ψ ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaGaem4ta8eabeaakiabcIcaOiabdIfayjabcUfaBjabdkgaIjabcYcaSiabdwgaLjabc2faDjabcYha8jabdghaXjabcYcaSGGaciab=H8a5jabcMcaPiabg2da9maarahabaGaemiuaaLaeiikaGIaemiwaGLaei4waSLaem4AaSMaeiyxa0LaeiiFaWNaem4uamLaemyzauMaemiBaWMaemyzauMaem4yamMaemiDaqNaemyta0Kaem4Ba8MaemizaqMaemyzauMaemiBaWMaeiikaGIaemiwaGLaeiilaWIaemyzauMaeiilaWIaem4AaSMaeiilaWIaemOEaONaeiilaWIaemyCaeNaeiilaWIae8hYdKNaeiykaKIaeiykaKcaleaacqWGRbWAcqGH9aqpcqWGIbGyaeaacqWGLbqza0Gaey4dIunaaaa@6C9F@ The probability of emitting each column in the alignment is defined by a sequence model returned by SelectModel(X, e, k, z, q, ψ). The current position k in alignment X, the end position of the scored interval (e), current state q, protein coding phase z, and phylogenetic parameters ψ determine the choice of sequence models. If q is an exon state and k is within the coding region, the coding phase z is 0, 1 or 2 and -1 otherwise. When q is an exon state and k is outside the coding region, an untranslated exon region is implied. The sequence models are divided into three "template" categories, MSige MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtdaWgaaWcbaGaem4uamLaemyAaKMaem4zaC2aaSbaaWqaaiabdwgaLbqabaaaleqaaaaa@3367@, Mcodon, and Mnon-coding and an instance of one of these three types is returned by the function: SelectModel (X, e, k, z, q, ψ) = { M S i g e ( X , k , e , ψ ) k > e − W S i g e M c o d o n ( X , k , z , ψ ) z ≠ − 1 M n o n − c o d i n g ( X , k , ψ ) o t h e r w i s e } MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGadeqaauaabeqadiaaaeaacqWGnbqtdaWgaaWcbaGaem4uamLaemyAaKMaem4zaC2aaSbaaWqaaiabdwgaLbqabaaaleqaaOGaeiikaGIaemiwaGLaeiilaWIaem4AaSMaeiilaWIaemyzauMaeiilaWccciGae8hYdKNaeiykaKcabaGaem4AaSMaeyOpa4JaemyzauMaeyOeI0Iaem4vaC1aaSbaaSqaaiabdofatjabdMgaPjabdEgaNnaaBaaameaacqWGLbqzaeqaaaWcbeaaaOqaaiabd2eannaaBaaaleaacqWGJbWycqWGVbWBcqWGKbazcqWGVbWBcqWGUbGBaeqaaOGaeiikaGIaemiwaGLaeiilaWIaem4AaSMaeiilaWIaemOEaONaeiilaWIae8hYdKNaeiykaKcabaGaemOEaONaeyiyIKRaeyOeI0IaeGymaedabaGaemyta00aaSbaaSqaaiabd6gaUjabd+gaVjabd6gaUjabgkHiTiabdogaJjabd+gaVjabdsgaKjabdMgaPjabd6gaUjabdEgaNbqabaGccqGGOaakcqWGybawcqGGSaalcqWGRbWAcqGGSaalcqWFipqEcqGGPaqkaeaacqWGVbWBcqWG0baDcqWGObaAcqWGLbqzcqWGYbGCcqWG3bWDcqWGPbqAcqWGZbWCcqWGLbqzaaaacaGL7bGaayzFaaaaaa@85B2@ In theory, each state could maintain separate sequence models. For example, the "Internal First Exon of IR" state could model codon usage separately from the "Internal Last Exon of IR" state. In practice, this results in far too many parameters to estimate given training data sizes. Instead the states are tied to 10 candidate sequence models returned by SelectModel. The models are listed with the analogous Markov models commonly used in single isoform ab initio gene finders. • Mnon-coding (X, k, ψ). 3rd order homogeneous Markov model: P(S0[k]|S0[k - 3, k - 1])           - MAUTR(X, k, ψ) – alternative 5'/3' untranslated region (AUTR)           - MCUTR (X, k, ψ) – constitutive 5'/3' untranslated region (CUTR)           - MAI (X, k, ψ) – alternative intron (AI)           - MCI (X, k, ψ) – constitutive intron (CI) • Mcodon (X, k, z, ψ). 3rd order inhomogeneous 3-periodic Markov model: Pz (S0[k]|S0[k - 3, k - 1])           - MAE (X, k, z, ψ) – alternative exon (AE)           - MCE (X, k, z, ψ) – constitutive exon (CE) • Mdon (X, k, e, ψ). 1st order inhomogeneous Markov model (WAM): P9-(e - k) (S0[k]|S0[k - 1])           - MSD (X, k, e, ψ) – constitutive/single donor (SD)           - MAD (X, k, e, ψ) – alternative donor (AD) (covers all alternative donor types) • Macc (X, k, e, ψ). 1st order inhomogeneous Markov model (WAM): P24-(e - k) (S[k]|S[k - 1])           - MSA (X, k, e, ψ)- constitutive/single acceptor (SA)           - MAA (X, k, e, ψ)- alternative acceptor (AA) (covers all alternative acceptor types)