PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T1	290-473	Epistemic_statement	denotes	These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences.
T2	707-828	Epistemic_statement	denotes	A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale.
T3	829-1066	Epistemic_statement	denotes	However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003) .
T4	1533-1692	Epistemic_statement	denotes	One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994) .
T5	1693-1788	Epistemic_statement	denotes	The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007) .
T6	2301-2389	Epistemic_statement	denotes	Any DNA sequence can be converted into a unique numerical sequence with the same length.
T7	2750-2838	Epistemic_statement	denotes	However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction.
T8	3268-3382	Epistemic_statement	denotes	In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences.
T9	3524-3859	Epistemic_statement	denotes	In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k. This multinomial distribution will be called the conditional multinomial distribution.
T10	3860-4009	Epistemic_statement	denotes	The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species.
T11	4101-4273	Epistemic_statement	denotes	In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors.
T12	4842-4956	Epistemic_statement	denotes	A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet N ¼ fA,C,G,Tg.
T13	5177-5384	Epistemic_statement	denotes	,s n , GINðmÞ ¼ k, where k¼min value of i such that s m ¼ s m þ i ,m þ i rn else k¼nÀ m. We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given as GIN ¼ 6,9,1,2,3,6,3,1,3,2,1,0:
T14	5385-5617	Epistemic_statement	denotes	From the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide x A N. Four inter-nucleotide distance sequences for the same short DNA segment used previously were given as
T15	5618-5716	Epistemic_statement	denotes	A similar inter-nucleotide distance sequence to the nucleotide x A N was defined by Afreixo et al.
T16	6349-6471	Epistemic_statement	denotes	In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence.
T17	6472-6602	Epistemic_statement	denotes	Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences.
T18	6884-7119	Epistemic_statement	denotes	If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution.
T19	7548-7640	Epistemic_statement	denotes	The nucleotide occurrence probability p GjA and p TjA can be obtained in the similar method.
T20	7641-8006	Epistemic_statement	denotes	The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other.
T21	8007-8161	Epistemic_statement	denotes	From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution.
T22	8280-8351	Epistemic_statement	denotes	Many neutral mutations may remain and play a role of random background.
T23	8352-8549	Epistemic_statement	denotes	One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011; Ding et al., 2010; Gao et al., 2006) .
T24	8550-8783	Epistemic_statement	denotes	In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background.
T25	8784-8953	Epistemic_statement	denotes	For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide x A fA,C,G,Tg.
T26	8954-9089	Epistemic_statement	denotes	For a certain pattern a from the conditional multinomial distribution, we can define the multinomial composition value pðaÞ as follows:
T27	9090-9294	Epistemic_statement	denotes	where the f x 0 ðajkÞ is the measured relative frequency of the pattern a, the relative frequency of the pattern a from the reference conditional multinomial distribution f x ðajkÞ can be computed by (1).
T28	9295-9422	Epistemic_statement	denotes	All these multinomial composition values can be sorted in some order to form a vector V x ðSjkÞ ¼ ðp x ða 1 jkÞ,p x ða 2 jkÞ, .
T29	9427-9525	Epistemic_statement	denotes	,p x ða m jkÞÞ for the genome S, where m denotes the total number of patterns under consideration.
T30	9814-9938	Epistemic_statement	denotes	For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information.
T31	10055-10154	Epistemic_statement	denotes	,VðSjkÞ, denoted by CMCV (S, k), with the intention to use as much genomic information as possible.
T32	10406-10588	Epistemic_statement	denotes	In the case of inter-nucleotide distance sequence CIN A (k¼5), we firstly convert the possible value of the ðN CjA ,N GjA ,N TjA Þ into onedimensional value by the order of alphabet.
T33	12237-12359	Epistemic_statement	denotes	The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence.
T34	12360-12469	Epistemic_statement	denotes	Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment.
T35	12604-12814	Epistemic_statement	denotes	The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses.
T36	13076-13158	Epistemic_statement	denotes	Generally, coronavirus can be classified into three groups according to serotypes.
T37	13333-13845	Epistemic_statement	denotes	However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003; Rota et al., 2003) ; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Li o and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005) .
T38	14375-14399	Epistemic_statement	denotes	As can be seen from Fig.
T39	14400-14554	Epistemic_statement	denotes	4 , our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV).
T40	15434-15667	Epistemic_statement	denotes	However, many patterns will not occur in the conditional multinomial distribution with a large value of k. From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered.
T41	15668-15819	Epistemic_statement	denotes	To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is.
T42	15820-16012	Epistemic_statement	denotes	w 2 ÀTest scoring scheme: For a fixed k, let a be a pattern in the conditional multinomial distribution, with its multinomial composition value pða,ijkÞ in genome i (could be found in k ÀMCV).
T43	16449-16539	Epistemic_statement	denotes	Thus, we may define a score for the conditional multinomial distribution with a fixed k as
T44	16540-16865	Epistemic_statement	denotes	where the first sum is for all patterns of the conditional multinomial distribution with a fixed k. We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution.
T45	16866-17080	Epistemic_statement	denotes	In this case, the largest score of conditional multinomial distribution, the k ÀMCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis.
T46	17436-17624	Epistemic_statement	denotes	Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k ÀMCV to the CMCV which will involve the kÀ MCV.
T47	17625-17718	Epistemic_statement	denotes	Form Table 3 , we can clearly see that the relative ratio of 7À MCV is the maximum 839 1408 .
T48	17923-18280	Epistemic_statement	denotes	DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness.
T49	18451-18758	Epistemic_statement	denotes	From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species.
T50	19046-19135	Epistemic_statement	denotes	The phylogenetic tree can be gotten through the distance matrices using the UPGMA method.
T51	19389-19480	Epistemic_statement	denotes	4 also indicates that SARS-CoVs are not closely related to any groups and form a new group.
T52	19637-19813	Epistemic_statement	denotes	Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA.

T1

Epistemic_statement

denotes

These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences.

T2

707-828

Epistemic_statement

denotes

A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale.

T3

829-1066

Epistemic_statement

denotes

However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003) .

T4

1533-1692

Epistemic_statement

denotes

One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994) .

T5

1693-1788

Epistemic_statement

denotes

The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007) .

T6

2301-2389

Epistemic_statement

denotes

Any DNA sequence can be converted into a unique numerical sequence with the same length.

T7

2750-2838

Epistemic_statement

denotes

However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction.

T8

3268-3382

Epistemic_statement

denotes

In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences.

T9

3524-3859

Epistemic_statement

denotes

In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k. This multinomial distribution will be called the conditional multinomial distribution.

T10

3860-4009

Epistemic_statement

denotes

The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species.

T11

4101-4273

Epistemic_statement

denotes

In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors.

T12

4842-4956

Epistemic_statement

denotes

A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet N ¼ fA,C,G,Tg.

T13

5177-5384

Epistemic_statement

denotes

,s n , GINðmÞ ¼ k, where k¼min value of i such that s m ¼ s m þ i ,m þ i rn else k¼nÀ m. We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given as GIN ¼ 6,9,1,2,3,6,3,1,3,2,1,0:

T14

5385-5617

Epistemic_statement

denotes

From the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide x A N. Four inter-nucleotide distance sequences for the same short DNA segment used previously were given as

T15

5618-5716

Epistemic_statement

denotes

A similar inter-nucleotide distance sequence to the nucleotide x A N was defined by Afreixo et al.

T16

6349-6471

Epistemic_statement

denotes

In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence.

T17

6472-6602

Epistemic_statement

denotes

Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences.

T18

6884-7119

Epistemic_statement

denotes

If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution.

T19

7548-7640

Epistemic_statement

denotes

The nucleotide occurrence probability p GjA and p TjA can be obtained in the similar method.

T20

7641-8006

Epistemic_statement

denotes

The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other.

T21

8007-8161

Epistemic_statement

denotes

From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution.

T22

8280-8351

Epistemic_statement

denotes

Many neutral mutations may remain and play a role of random background.

T23

8352-8549

Epistemic_statement

denotes

One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011; Ding et al., 2010; Gao et al., 2006) .

T24

8550-8783

Epistemic_statement

denotes

In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background.

T25

8784-8953

Epistemic_statement

denotes

For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide x A fA,C,G,Tg.

T26

8954-9089

Epistemic_statement

denotes

For a certain pattern a from the conditional multinomial distribution, we can define the multinomial composition value pðaÞ as follows:

T27

9090-9294

Epistemic_statement

denotes

where the f x 0 ðajkÞ is the measured relative frequency of the pattern a, the relative frequency of the pattern a from the reference conditional multinomial distribution f x ðajkÞ can be computed by (1).

T28

9295-9422

Epistemic_statement

denotes

All these multinomial composition values can be sorted in some order to form a vector V x ðSjkÞ ¼ ðp x ða 1 jkÞ,p x ða 2 jkÞ, .

T29

9427-9525

Epistemic_statement

denotes

,p x ða m jkÞÞ for the genome S, where m denotes the total number of patterns under consideration.

T30

9814-9938

Epistemic_statement

denotes

For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information.

T31

10055-10154

Epistemic_statement

denotes

,VðSjkÞ, denoted by CMCV (S, k), with the intention to use as much genomic information as possible.

T32

10406-10588

Epistemic_statement

denotes

In the case of inter-nucleotide distance sequence CIN A (k¼5), we firstly convert the possible value of the ðN CjA ,N GjA ,N TjA Þ into onedimensional value by the order of alphabet.

T33

12237-12359

Epistemic_statement

denotes

The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence.

T34

12360-12469

Epistemic_statement

denotes

Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment.

T35

12604-12814

Epistemic_statement

denotes

The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses.

T36

13076-13158

Epistemic_statement

denotes

Generally, coronavirus can be classified into three groups according to serotypes.

T37

13333-13845

Epistemic_statement

denotes

However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003; Rota et al., 2003) ; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Li o and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005) .

T38

14375-14399

Epistemic_statement

denotes

As can be seen from Fig.

T39

14400-14554

Epistemic_statement

denotes

4 , our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV).

T40

15434-15667

Epistemic_statement

denotes

However, many patterns will not occur in the conditional multinomial distribution with a large value of k. From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered.

T41

15668-15819

Epistemic_statement

denotes

To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is.

T42

15820-16012

Epistemic_statement

denotes

w 2 ÀTest scoring scheme: For a fixed k, let a be a pattern in the conditional multinomial distribution, with its multinomial composition value pða,ijkÞ in genome i (could be found in k ÀMCV).

T43

16449-16539

Epistemic_statement

denotes

Thus, we may define a score for the conditional multinomial distribution with a fixed k as

T44

16540-16865

Epistemic_statement

denotes

where the first sum is for all patterns of the conditional multinomial distribution with a fixed k. We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution.

T45

16866-17080

Epistemic_statement

denotes

In this case, the largest score of conditional multinomial distribution, the k ÀMCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis.

T46

17436-17624

Epistemic_statement

denotes

Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k ÀMCV to the CMCV which will involve the kÀ MCV.

T47

17625-17718

Epistemic_statement

denotes

Form Table 3 , we can clearly see that the relative ratio of 7À MCV is the maximum 839 1408 .

T48

17923-18280

Epistemic_statement

denotes

DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness.

T49

18451-18758

Epistemic_statement

denotes

From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species.

T50

19046-19135

Epistemic_statement

denotes

The phylogenetic tree can be gotten through the distance matrices using the UPGMA method.

T51

19389-19480

Epistemic_statement

denotes

4 also indicates that SARS-CoVs are not closely related to any groups and form a new group.

T52

19637-19813

Epistemic_statement

denotes

Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA.

CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSON TXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE

CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSON TXT 9 Projects