From Markov models to evolutionary models
The key difference between our implementation and a single isoform ab initio gene finder is two fold: 1) separate models are maintained for the two splicing types: alternative and constitutive, and 2) the nucleotide dependencies are modeled using an evolutionary framework. The choice to separate sequence models for the two splicing types is motivated by the previous work explicitly classifying alternatively spliced exons and by the hypothesis that a splice site can be activated or deactivated with proximal splicing factors binding to the pre-mRNA sequence to interact directly (or indirectly) with the Spliceosome. The presence of splicing factors in conjunction with the characteristics of the splice site is expected to determine splice site usage. There is growing evidence that many alternative splicing events follow this model [30,31].
Each sequence model estimates the probability of emitting column X[k] using a phylogenetic tree. Figure 4 shows a schematic of the phylogenetic tree for four Drosophila species used in testing. The goal is to compute the probability of the observed column having evolved from a common ancestral sequence. For the tree in Figure 4, assume the ancestral base at the root ("Ancestor 1") to be A and the descendant node ("Ancestor 2") to be C. The probability of A evolving to C is computed using a nucleotide substitution model. The HYK model [32] was chosen for the use of three features: distinguished transition/transversion mutation events (assumed to be a fixed parameter), a nucleotide equilibrium model, and the evolutionary time interval defined by the tree branch length. The probability of emitting column X[k] is found by computing, in linear time with respect to the number of nodes in the tree, the probability of all possible ancestral sequences having evolved into the observed column using the Felsenstein phylogenetic tree scoring procedure [33].
Figure 4  Phylogenetic tree for four species of Drosophila. Each branch i has a branch length of bi. The nucleotide equilibrium parameters of the HYK model are naturally suited to incorporate the nucleotide bias found in the different sequence models (e.g. donor, codon, etc.). With a multiple sequence alignment as input, an intuitive extension to the ab initio Markov model is to use the preceding o bases from each input sequence to estimate the likelihood of the current nucleotide (where o is the order of the Markov model). For example, in Figure 4, estimating the probability of nucleotide C at "Ancestor 2" having evolved from nucleotide A at "Ancestor 1", should reflect the nucleotide equilibrium of D. melanogaster and D. simulans, given the o previous bases in the input alignment for the two species. Similarly, estimating the probability of the root ancestral base being A (Ancestor 1) should reflect the nucleotide equilibrium among all four species given the o previous nucleotides in the input alignment from all four species. If dv is the number of descendants at node v, the number of parameters is ∑v4dv+o+1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeqaqaaiabisda0maaCaaaleqabaGaemizaq2aaSbaaWqaaiabdAha2bqabaWccqGHRaWkcqWGVbWBcqGHRaWkcqaIXaqmaaaabaGaemODayhabeqdcqGHris5aaaa@3835@ (v enumerating over all nodes in the tree) leaving too many parameters to reliably estimate, given the current limits on training sizes.
Frequency counts obtained from each organism independently, reduce the parameter size for each sequence model to (m + 1) × 4o + 1 (where m + 1 is the number of organisms). Let S′0,…,S′m′ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqbamaaBaaaleaacqaIWaamaeqaaOGaeiilaWIaeSOjGSKaeiilaWIafm4uamLbauaadaWgaaWcbaGafmyBa0Mbauaaaeqaaaaa@34C3@ be the sequences descendant from node v. If c(S′i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqbamaaBaaaleaacqWGPbqAaeqaaaaa@2F6E@[k - o, k - 1], n) returns the number of times each nucleotide n ∈ {A, C, G, T} was observed to follow the substring S′i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaqbamaaBaaaleaacqWGPbqAaeqaaaaa@2F6E@[k - o, k - 1] at position k in a training alignment, the nucleotide equilibrium at node v for each nucleotide n is:
∑ i = 0  m ′   c ( S ′  i    [ k − o , k − 1 ] , n ) / ( ∑ i = 0  m ′   ∑ n ′  ∈ { A , C , G , T }   c ( S ′  i  [ k − o , k − 1 ] , n ′  ) )       MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWbqaaiabdogaJjabcIcaOiqbdofatzaafaWaaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKMaeyypa0JaeGimaadabaGafmyBa0Mbauaaa0GaeyyeIuoakiabcUfaBjabdUgaRjabgkHiTiabd+gaVjabcYcaSiabdUgaRjabgkHiTiabigdaXiabc2faDjabcYcaSiabd6gaUjabcMcaPiabc+caViabcIcaOmaaqahabaWaaabuaeaacqWGJbWycqGGOaakcuWGtbWugaqbamaaBaaaleaacqWGPbqAaeqaaOGaei4waSLaem4AaSMaeyOeI0Iaem4Ba8MaeiilaWIaem4AaSMaeyOeI0IaeGymaeJaeiyxa0LaeiilaWIafmOBa4MbauaacqGGPaqkcqGGPaqkaSqaaiqbd6gaUzaafaGaeyicI4Saei4EaSNaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiilaWIaemivaqLaeiyFa0habeqdcqGHris5aaWcbaGaemyAaKMaeyypa0JaeGimaadabaGafmyBa0Mbauaaa0GaeyyeIuoaaaa@7121@
Tree branch lengths are assumed to be fixed, but functional sequence elements are expected to exhibit a slower rate of substitution. Each sequence model maintains substitution rates to either expand the branch lengths of the tree (for rates greater than 1) or contract the branch lengths. Longer branch lengths have the effect of allowing mutations to accumulate in a column without incurring a scoring penalty, whereas, shorter branch lengths reward perfectly preserved columns. The codon, intron, and untranslated region states use two substitution rates, one rate for when the input column is conserved (no mutations observed) and a second rate when the column is not conserved. If a mutation is observed in a codon where the encoded amino acid is preserved, the higher substitution rate is selected to better accept the mutation. In the case of splice sites, each base is assumed to be subject to selective pressure and a single rate is used. An optimal parse of multiple sequence alignment X is found taking the log ratio of a state emitting the columns in X versus an equivalent model assuming all nucleotides are equally probable (the "background" state). Using a dynamic programming matrix D(j, q) initialized to
l o g ( P O  ( X [ 0 , j ] | q , ψ ) × P π  ( q ) × P L , q   ( j + 1 )  P O  ( X [ 0 , j ] | b a c k g r o u n d , ψ )   )   MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieGacqWFSbaBcqWFVbWBcqWFNbWzcqGGOaakdaWcaaqaaiabdcfaqnaaBaaaleaacqWGpbWtaeqaaOGaeiikaGIaemiwaGLaei4waSLaeGimaaJaeiilaWIaemOAaOMaeiyxa0LaeiiFaWNaemyCaeNaeiilaWccciGae4hYdKNaeiykaKIaey41aqRaemiuaa1aaSbaaSqaaiab+b8aWbqabaGccqGGOaakcqWGXbqCcqGGPaqkcqGHxdaTcqWGqbaudaWgaaWcbaGaemitaWKaeiilaWIaemyCaehabeaakiabcIcaOiabdQgaQjabgUcaRiabigdaXiabcMcaPaqaaiabdcfaqnaaBaaaleaacqWGpbWtaeqaaOGaeiikaGIaemiwaGLaei4waSLaeGimaaJaeiilaWIaemOAaOMaeiyxa0LaeiiFaWNaemOyaiMaemyyaeMaem4yamMaem4AaSMaem4zaCMaemOCaiNaem4Ba8MaemyDauNaemOBa4MaemizaqMaeiilaWIae4hYdKNaeiykaKcaaiabcMcaPaaa@7432@
entries for each nucleotide j and state q are assigned a value:
D ( j , q ) =    m a x i , q ′    l o g ( P O  ( X [ i , j ] | q , ψ ) × P T  ( q | q ′  ) × P L , q   ( j − i + 1 )  P O  ( X [ i , j ] | b a c k g r o u n d , ψ )   + D ( i , q ′  )       MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeGabaaabaGaemiraqKaeiikaGIaemOAaOMaeiilaWIaemyCaeNaeiykaKIaeyypa0dabaacbiGae8xBa0Mae8xyaeMae8hEaG3aaSbaaSqaaiabdMgaPjabcYcaSiqbdghaXzaafaaabeaakiab=XgaSjab=9gaVjab=DgaNjabcIcaOmaalaaabaGaemiuaa1aaSbaaSqaaiabd+eapbqabaGccqGGOaakcqWGybawcqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGDbqxcqGG8baFcqWGXbqCcqGGSaaliiGacqGFipqEcqGGPaqkcqGHxdaTcqWGqbaudaWgaaWcbaGaemivaqfabeaakiabcIcaOiabdghaXjabcYha8jqbdghaXzaafaGaeiykaKIaey41aqRaemiuaa1aaSbaaSqaaiabdYeamjabcYcaSiabdghaXbqabaGccqGGOaakcqWGQbGAcqGHsislcqWGPbqAcqGHRaWkcqaIXaqmcqGGPaqkaeaacqWGqbaudaWgaaWcbaGaem4ta8eabeaakiabcIcaOiabdIfayjabcUfaBjabdMgaPjabcYcaSiabdQgaQjabc2faDjabcYha8jabdkgaIjabdggaHjabdogaJjabdUgaRjabdEgaNjabdkhaYjabd+gaVjabdwha1jabd6gaUjabdsgaKjabcYcaSiab+H8a5jabcMcaPaaacqGHRaWkcqWGebarcqGGOaakcqWGPbqAcqGGSaalcuWGXbqCgaqbaiabcMcaPaaaaaa@8FC3@
Exons are recovered from the parse ending at the highest scoring entry maxq D(N - 1, q) where N is the length of X. The runtime of the algorithm is O(|Q|2 × N2) where |Q| is the number of states in the model. The PGHMM can easily be transformed to a single species alternative exon predictor by assuming a single input sequence and single node phylogenetic tree. The evolutionary models reduce to the single sequence Markov model equivalents and are used to measure the impact of sequence conservation on prediction performance.