Id |
Subject |
Object |
Predicate |
Lexical cue |
T1 |
290-473 |
Epistemic_statement |
denotes |
These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences. |
T2 |
707-828 |
Epistemic_statement |
denotes |
A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale. |
T3 |
829-1066 |
Epistemic_statement |
denotes |
However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003) . |
T4 |
1533-1692 |
Epistemic_statement |
denotes |
One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994) . |
T5 |
1693-1788 |
Epistemic_statement |
denotes |
The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007) . |
T6 |
2301-2389 |
Epistemic_statement |
denotes |
Any DNA sequence can be converted into a unique numerical sequence with the same length. |
T7 |
2750-2838 |
Epistemic_statement |
denotes |
However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction. |
T8 |
3268-3382 |
Epistemic_statement |
denotes |
In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences. |
T9 |
3524-3859 |
Epistemic_statement |
denotes |
In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k. This multinomial distribution will be called the conditional multinomial distribution. |
T10 |
3860-4009 |
Epistemic_statement |
denotes |
The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species. |
T11 |
4101-4273 |
Epistemic_statement |
denotes |
In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors. |
T12 |
4842-4956 |
Epistemic_statement |
denotes |
A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet N ¼ fA,C,G,Tg. |
T13 |
5177-5384 |
Epistemic_statement |
denotes |
,s n , GINðmÞ ¼ k, where k¼min value of i such that s m ¼ s m þ i ,m þ i rn else k¼nÀ m. We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given as GIN ¼ 6,9,1,2,3,6,3,1,3,2,1,0: |
T14 |
5385-5617 |
Epistemic_statement |
denotes |
From the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide x A N. Four inter-nucleotide distance sequences for the same short DNA segment used previously were given as |
T15 |
5618-5716 |
Epistemic_statement |
denotes |
A similar inter-nucleotide distance sequence to the nucleotide x A N was defined by Afreixo et al. |
T16 |
6349-6471 |
Epistemic_statement |
denotes |
In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence. |
T17 |
6472-6602 |
Epistemic_statement |
denotes |
Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences. |
T18 |
6884-7119 |
Epistemic_statement |
denotes |
If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution. |
T19 |
7548-7640 |
Epistemic_statement |
denotes |
The nucleotide occurrence probability p GjA and p TjA can be obtained in the similar method. |
T20 |
7641-8006 |
Epistemic_statement |
denotes |
The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other. |
T21 |
8007-8161 |
Epistemic_statement |
denotes |
From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution. |
T22 |
8280-8351 |
Epistemic_statement |
denotes |
Many neutral mutations may remain and play a role of random background. |
T23 |
8352-8549 |
Epistemic_statement |
denotes |
One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011; Ding et al., 2010; Gao et al., 2006) . |
T24 |
8550-8783 |
Epistemic_statement |
denotes |
In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background. |
T25 |
8784-8953 |
Epistemic_statement |
denotes |
For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide x A fA,C,G,Tg. |
T26 |
8954-9089 |
Epistemic_statement |
denotes |
For a certain pattern a from the conditional multinomial distribution, we can define the multinomial composition value pðaÞ as follows: |
T27 |
9090-9294 |
Epistemic_statement |
denotes |
where the f x 0 ðajkÞ is the measured relative frequency of the pattern a, the relative frequency of the pattern a from the reference conditional multinomial distribution f x ðajkÞ can be computed by (1). |
T28 |
9295-9422 |
Epistemic_statement |
denotes |
All these multinomial composition values can be sorted in some order to form a vector V x ðSjkÞ ¼ ðp x ða 1 jkÞ,p x ða 2 jkÞ, . |
T29 |
9427-9525 |
Epistemic_statement |
denotes |
,p x ða m jkÞÞ for the genome S, where m denotes the total number of patterns under consideration. |
T30 |
9814-9938 |
Epistemic_statement |
denotes |
For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information. |
T31 |
10055-10154 |
Epistemic_statement |
denotes |
,VðSjkÞ, denoted by CMCV (S, k), with the intention to use as much genomic information as possible. |
T32 |
10406-10588 |
Epistemic_statement |
denotes |
In the case of inter-nucleotide distance sequence CIN A (k¼5), we firstly convert the possible value of the ðN CjA ,N GjA ,N TjA Þ into onedimensional value by the order of alphabet. |
T33 |
12237-12359 |
Epistemic_statement |
denotes |
The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence. |
T34 |
12360-12469 |
Epistemic_statement |
denotes |
Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment. |
T35 |
12604-12814 |
Epistemic_statement |
denotes |
The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses. |
T36 |
13076-13158 |
Epistemic_statement |
denotes |
Generally, coronavirus can be classified into three groups according to serotypes. |
T37 |
13333-13845 |
Epistemic_statement |
denotes |
However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003; Rota et al., 2003) ; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Li o and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005) . |
T38 |
14375-14399 |
Epistemic_statement |
denotes |
As can be seen from Fig. |
T39 |
14400-14554 |
Epistemic_statement |
denotes |
4 , our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV). |
T40 |
15434-15667 |
Epistemic_statement |
denotes |
However, many patterns will not occur in the conditional multinomial distribution with a large value of k. From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered. |
T41 |
15668-15819 |
Epistemic_statement |
denotes |
To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is. |
T42 |
15820-16012 |
Epistemic_statement |
denotes |
w 2 ÀTest scoring scheme: For a fixed k, let a be a pattern in the conditional multinomial distribution, with its multinomial composition value pða,ijkÞ in genome i (could be found in k ÀMCV). |
T43 |
16449-16539 |
Epistemic_statement |
denotes |
Thus, we may define a score for the conditional multinomial distribution with a fixed k as |
T44 |
16540-16865 |
Epistemic_statement |
denotes |
where the first sum is for all patterns of the conditional multinomial distribution with a fixed k. We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution. |
T45 |
16866-17080 |
Epistemic_statement |
denotes |
In this case, the largest score of conditional multinomial distribution, the k ÀMCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis. |
T46 |
17436-17624 |
Epistemic_statement |
denotes |
Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k ÀMCV to the CMCV which will involve the kÀ MCV. |
T47 |
17625-17718 |
Epistemic_statement |
denotes |
Form Table 3 , we can clearly see that the relative ratio of 7À MCV is the maximum 839 1408 . |
T48 |
17923-18280 |
Epistemic_statement |
denotes |
DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness. |
T49 |
18451-18758 |
Epistemic_statement |
denotes |
From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species. |
T50 |
19046-19135 |
Epistemic_statement |
denotes |
The phylogenetic tree can be gotten through the distance matrices using the UPGMA method. |
T51 |
19389-19480 |
Epistemic_statement |
denotes |
4 also indicates that SARS-CoVs are not closely related to any groups and form a new group. |
T52 |
19637-19813 |
Epistemic_statement |
denotes |
Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA. |