PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T1	0-69	Sentence	denotes	Genome analysis with the conditional multinomial distribution profile
T2	71-79	Sentence	denotes	Abstract
T3	80-145	Sentence	denotes	The focus of the research is on the analysis of genome sequences.
T4	146-289	Sentence	denotes	Based on the inter-nucleotide distance sequence, we propose the conditional multinomial distribution profile for the complete genomic sequence.
T5	290-473	Sentence	denotes	These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences.
T6	474-639	Sentence	denotes	We use this distance measure to classify chromosomes according to species of origin, to build the phylogenetic tree of 24 complete genome sequences of coronaviruses.
T7	640-705	Sentence	denotes	Our results demonstrate the new method is powerful and efficient.
T8	707-828	Sentence	denotes	A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale.
T9	829-1066	Sentence	denotes	However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003) .
T10	1067-1177	Sentence	denotes	Consequently, considerable efforts have been made to seek for the alignment-free method for sequence analysis.
T11	1178-1417	Sentence	denotes	The first, mainly based on graphical representation of the sequence, is very convenient for studying several selected cases (Liao et al., 2005; Jeffrey, 1990; Nandy, 1994; Nandy and Nandy, 1995, 2003; Randic and Vracko, 2000; Randic et al.
T12	1418-1532	Sentence	denotes	2001 Randic et al. , 2003a Randic et al. ,b, 2006 Randic and Balaban, 2003; Randic, 2008; Zhang and Zhang, 1994) .
T13	1533-1692	Sentence	denotes	One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994) .
T14	1693-1788	Sentence	denotes	The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007) .
T15	1789-1973	Sentence	denotes	For this purpose one has to find representative descriptors that characterize an abstract mathematical representation of the biological sequence (Dai et al., 2006; He and Wang, 2002) .
T16	1974-2125	Sentence	denotes	A commonly used numerical characterization of the sequence is to consider binary sequences that describe the position of each nucleotide (Voss, 1992) .
T17	2126-2202	Sentence	denotes	Different approaches are described in a recent review (Nandy et al., 2006) .
T18	2203-2300	Sentence	denotes	Nair and Mahalashmi (2005) proposed the inter-nucleotide distance as a new DNA numerical profile.
T19	2301-2389	Sentence	denotes	Any DNA sequence can be converted into a unique numerical sequence with the same length.
T20	2390-2511	Sentence	denotes	In the representation, each number represents the distance of a nucleotide to the next occurrence of the same nucleotide.
T21	2512-2749	Sentence	denotes	Meanwhile Nair and Mahalashmi (2005) employed discrete Fourier transformation to the inter-nucleotide distance sequence and indicated that this method has a discriminatory capability for highlighting the promoter region of gene sequence.
T22	2750-2838	Sentence	denotes	However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction.
T23	2839-2994	Sentence	denotes	Afreixo et al. (2009) developed a new method to analyze the inter-nucleotide distance sequence and extracted some interesting features of the DNA sequence.
T24	2995-3120	Sentence	denotes	Four nucleotide internucleotide distance distributions and a global distance distribution were given to each genome sequence.
T25	3121-3267	Sentence	denotes	In each nucleotide internucleotide distance distribution, only the total number of three other nucleotides was considered (Afreixo et al., 2009) .
T26	3268-3382	Sentence	denotes	In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences.
T27	3383-3523	Sentence	denotes	Motivated by the aforementioned work, we construct four conditional multinomial distributions from four inter-nucleotide distance sequences.
T28	3524-3772	Sentence	denotes	In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k.
T29	3773-3859	Sentence	denotes	This multinomial distribution will be called the conditional multinomial distribution.
T30	3860-4009	Sentence	denotes	The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species.
T31	4010-4100	Sentence	denotes	This approach allows us to perform comparative analysis between complete genome sequences.
T32	4101-4273	Sentence	denotes	In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors.
T33	4274-4427	Sentence	denotes	These multinomial composition vectors are built on the relative error vectors of conditional multinomial distributions with k, where k is within a range.
T34	4428-4570	Sentence	denotes	The range of k is determined to ensure that the CMCV contains the largest amount of evolutionary information hidden in the whole genomic data.
T35	4571-4688	Sentence	denotes	We then define the evolutionary distance between two genomes based on their complete multinomial composition vectors.
T36	4689-4770	Sentence	denotes	The proposed method is tested by phylogenetic analysis on 24 coronavirus genomes.
T37	4771-4841	Sentence	denotes	Our results demonstrate that the new method is powerful and efficient.
T38	4842-4956	Sentence	denotes	A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet N ¼ fA,C,G,Tg.
T39	4957-5044	Sentence	denotes	The inter-nucleotide distance was originally introduced by Nair and Mahalashmi (2005) .
T40	5045-5134	Sentence	denotes	The global inter-nucleotide distance sequence referred to as GIN, was defined as follows:
T41	5135-5265	Sentence	denotes	Given a DNA sequence S ¼ s 1 ,s 2 , . . . ,s n , GINðmÞ ¼ k, where k¼min value of i such that s m ¼ s m þ i ,m þ i rn else k¼nÀ m.
T42	5266-5384	Sentence	denotes	We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given as GIN ¼ 6,9,1,2,3,6,3,1,3,2,1,0:
T43	5385-5515	Sentence	denotes	From the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide x A N.
T44	5516-5617	Sentence	denotes	Four inter-nucleotide distance sequences for the same short DNA segment used previously were given as
T45	5618-5725	Sentence	denotes	A similar inter-nucleotide distance sequence to the nucleotide x A N was defined by Afreixo et al. (2009) .
T46	5726-5811	Sentence	denotes	The four inter-nucleotide distance sequences for the short DNA fragment AGTTCTACCAGC:
T47	5812-5863	Sentence	denotes	considering that the symbolic sequence is circular.
T48	5864-6133	Sentence	denotes	The corresponding global distance sequence referred to as CIN is exemplified below for the same short DNA segment used previously, CIN ¼ 6, 9, 1, 2, 3, 9, 2, 1, 3, 3, 3, 5, which is slightly different from the non-circular approach used by Nair and Mahalashmi (2005) .
T49	6134-6348	Sentence	denotes	From the definition of the inter-nucleotide distance sequence, we clearly see that the total number of three other nucleotides was only considered in each inter-nucleotide distance sequence (Afreixo et al., 2009) .
T50	6349-6471	Sentence	denotes	In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence.
T51	6472-6602	Sentence	denotes	Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences.
T52	6603-6692	Sentence	denotes	We take the inter-nucleotide distance sequence about nucleotide A (CIN A ) as an example.
T53	6693-6883	Sentence	denotes	Considering the case of CIN A ¼ k (k ¼ 1,2, . . .), let p CjA ,p GjA and p TjA be the occurrence probabilities of nucleotides C, G, and T, respectively, between the nearest two nucleotide A.
T54	6884-7119	Sentence	denotes	If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution.
T55	7120-7285	Sentence	denotes	In fact, the joint probability function of ðN CjA ,N GjA ,N TjA Þ (the number of C, G, T, between the nearest two nucleotide A, respectively, given that CIN A ¼k) is
T56	7286-7291	Sentence	denotes	where
T57	7292-7547	Sentence	denotes	The nucleotide occurrence probability p CjA is estimated by the relative frequency TN CjA =ððkÀ1Þ Á N A Þ, where N A is the times of CIN A ¼k, TN CjA is the total number of C between the nearest two nucleotide A when the internucleotide distance CIN A ¼k.
T58	7548-7640	Sentence	denotes	The nucleotide occurrence probability p GjA and p TjA can be obtained in the similar method.
T59	7641-8006	Sentence	denotes	The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other.
T60	8007-8161	Sentence	denotes	From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution.
T61	8162-8279	Sentence	denotes	Mutations have been taking place randomly at molecular level and natural selections shape the direction of evolution.
T62	8280-8351	Sentence	denotes	Many neutral mutations may remain and play a role of random background.
T63	8352-8549	Sentence	denotes	One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011; Ding et al., 2010; Gao et al., 2006) .
T64	8550-8783	Sentence	denotes	In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background.
T65	8784-8953	Sentence	denotes	For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide x A fA,C,G,Tg.
T66	8954-9089	Sentence	denotes	For a certain pattern a from the conditional multinomial distribution, we can define the multinomial composition value pðaÞ as follows:
T67	9090-9294	Sentence	denotes	where the f x 0 ðajkÞ is the measured relative frequency of the pattern a, the relative frequency of the pattern a from the reference conditional multinomial distribution f x ðajkÞ can be computed by (1).
T68	9295-9525	Sentence	denotes	All these multinomial composition values can be sorted in some order to form a vector V x ðSjkÞ ¼ ðp x ða 1 jkÞ,p x ða 2 jkÞ, . . . ,p x ða m jkÞÞ for the genome S, where m denotes the total number of patterns under consideration.
T69	9526-9679	Sentence	denotes	Moreover, four vectors V A ðSjkÞ, V G ðSjkÞ, V C ðSjkÞ and V T ðSjkÞ are sorted in some order to form a vector VðSjkÞ that represents the whole genome S.
T70	9680-9813	Sentence	denotes	The vector defined by all these multinomial composition values is referred to as the k-order multinomial composition vector (k ÀMCV).
T71	9814-9938	Sentence	denotes	For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information.
T72	9939-10154	Sentence	denotes	The complete multinomial composition vector (CMCV) of the whole genome is the concatenation of VðSj3Þ,VðSj4Þ, . . . ,VðSjkÞ, denoted by CMCV (S, k), with the intention to use as much genomic information as possible.
T73	10155-10268	Sentence	denotes	We begin with the largest fragments of available DNA sequences, the chromosomes of eukaryotes listed in Table 1 .
T74	10269-10405	Sentence	denotes	The conditional multinomial distribution profile shown in Fig. 1 corresponds to three different chromosomes of Saccharomyces cerevisiae.
T75	10406-10588	Sentence	denotes	In the case of inter-nucleotide distance sequence CIN A (k¼5), we firstly convert the possible value of the ðN CjA ,N GjA ,N TjA Þ into onedimensional value by the order of alphabet.
T76	10589-10726	Sentence	denotes	We secondly plot the measured conditional multinomial distribution by bar and the reference conditional multinomial distribution by line.
T77	10727-10849	Sentence	denotes	We clearly see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of S. cerevisiae.
T78	10850-10900	Sentence	denotes	Three other cases are obtained in the similar way.
T79	10901-11042	Sentence	denotes	We see again the similarity between the conditional multinomial distribution profiles about a certain nucleotide for the various chromosomes.
T80	11043-11141	Sentence	denotes	If we repeat this experiment for the chromosomes of Caenorhabditis elegans we get the same result.
T81	11142-11349	Sentence	denotes	Again, when we plot the conditional multinomial distribution profile about a certain nucleotide we see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of C. elegans.
T82	11350-11418	Sentence	denotes	We demonstrate this with three chromosomes of C. elegans in Fig. 2 .
T83	11419-11721	Sentence	denotes	Again, while the pattern of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide is the same for all chromosomes of C. elegans, this pattern is distinctly different from the pattern of peaks and valleys in the S. cerevisiae profile about the same nucleotide.
T84	11722-11765	Sentence	denotes	Finally we repeat the experiment for Mouse.
T85	11766-11797	Sentence	denotes	The result is shown in Fig. 3 .
T86	11798-12091	Sentence	denotes	Once more we obtain a sequence of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide which are the same for all chromosomes of Mouse, and this pattern of peaks and valleys is different from the pattern in the S. cerevisiae and C. elegans profiles.
T87	12092-12236	Sentence	denotes	The complete multinomial composition vector of each complete genome provides a simple, easily computable signature that identifies each species.
T88	12237-12359	Sentence	denotes	The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence.
T89	12360-12469	Sentence	denotes	Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment.
T90	12470-12603	Sentence	denotes	Phylogenetic trees are generated by putting the pairwise distance matrix into UPGMA method in the PHYLIP package (Felsensein, 1989) .
T91	12604-12814	Sentence	denotes	The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses.
T92	12815-12972	Sentence	denotes	The 24 complete coronavirus genomes used in this paper were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses.
T93	12973-13075	Sentence	denotes	The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2 .
T94	13076-13158	Sentence	denotes	Generally, coronavirus can be classified into three groups according to serotypes.
T95	13159-13245	Sentence	denotes	Group I and group II contain mammalian viruses, whereas group III contains only avian.
T96	13246-13332	Sentence	denotes	Many investigations have attempted to identify the phylogenetic position of SARS-CoVs.
T97	13333-13845	Sentence	denotes	However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003; Rota et al., 2003) ; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Li o and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005) .
T98	13846-13974	Sentence	denotes	Based on the complete multinomial composition vector, we build the phylogenetic tree of the 24 coronaviruses listed in Table 2 .
T99	13975-14138	Sentence	denotes	The phylogenetic tree is built using the UPGMA programs in the PHYLIP package and the distance matrix is computed using the Euclidean distance (Felsensein, 1989) .
T100	14139-14374	Sentence	denotes	Our results based on analysis of the complete multinomial composition vector of 24 coronavirus genomes have some notable distinction from the previous phylogenetic study using an information-based similarity index (Yang et al., 2005) .
T101	14375-14554	Sentence	denotes	As can be seen from Fig. 4 , our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV).
T102	14555-14654	Sentence	denotes	Our results also show that group II, BCoV, BCoVL, BCoVM, etc., are grouped in a monophyletic clade.
T103	14655-14839	Sentence	denotes	This result is also mainly in accordance with the conclusions from the alignment-based method (Marra et al., 2003; Rota et al., 2003) and the alignment-free method (Liu et al., 2007) .
T104	14840-14930	Sentence	denotes	Moreover the Robinson-Foulds distance between our tree and the result of Liu's is only 26.
T105	14931-15040	Sentence	denotes	The selection of k in CMCV (S, k) is very important to capture rich evolutionary information of DNA sequence.
T106	15041-15118	Sentence	denotes	In the case of k ¼1, there is no nucleotide between two adjacent nucleotides.
T107	15119-15201	Sentence	denotes	In the case of k¼2, there is only one nucleotide between two adjacent nucleotides.
T108	15202-15277	Sentence	denotes	Therefore the multinomial composition value of a certain pattern a is zero.
T109	15278-15341	Sentence	denotes	The CMCV does not contain these multinomial composition values.
T110	15342-15433	Sentence	denotes	Certainly, a large value of k will give a vector containing finer evolutionary information.
T111	15434-15540	Sentence	denotes	However, many patterns will not occur in the conditional multinomial distribution with a large value of k.
T112	15541-15667	Sentence	denotes	From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered.
T113	15668-15845	Sentence	denotes	To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is. w 2 ÀTest scoring scheme:
T114	15846-16012	Sentence	denotes	For a fixed k, let a be a pattern in the conditional multinomial distribution, with its multinomial composition value pða,ijkÞ in genome i (could be found in k ÀMCV).
T115	16013-16247	Sentence	denotes	Define the expected multinomial composition value for pattern a to be the average of all composition values across all whole genomes, and denoted as, E½pðajkÞ i.e. E½pðajkÞ ¼ ð1=nÞ P n i ¼ 1 pða,ijkÞFassuming n genomes in the dataset.
T116	16248-16382	Sentence	denotes	The standard w 2 Àtest measures the deviation of a set of values from its expected value by summing up the deviations of each element.
T117	16383-16448	Sentence	denotes	Clearly, the higher value it has, the more valuable pattern a is.
T118	16449-16539	Sentence	denotes	Thus, we may define a score for the conditional multinomial distribution with a fixed k as
T119	16540-16639	Sentence	denotes	where the first sum is for all patterns of the conditional multinomial distribution with a fixed k.
T120	16640-16865	Sentence	denotes	We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution.
T121	16866-17080	Sentence	denotes	In this case, the largest score of conditional multinomial distribution, the k ÀMCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis.
T122	17081-17251	Sentence	denotes	We listed the score of the conditional multinomial distribution with a fixed k (within the range [3, 9] ) from the dataset of 24 complete coronavirus genomes in Table 3 .
T123	17252-17333	Sentence	denotes	The score of CMCV can be defined as sum of scores of k ÀMCV involved in the CMCV.
T124	17334-17435	Sentence	denotes	From Table 3 , it is clearly that there is no large difference after the 7À MCV is added in the CMCV.
T125	17436-17624	Sentence	denotes	Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k ÀMCV to the CMCV which will involve the kÀ MCV.
T126	17625-17718	Sentence	denotes	Form Table 3 , we can clearly see that the relative ratio of 7À MCV is the maximum 839 1408 .
T127	17719-17830	Sentence	denotes	Therefore, we select CMCV (S,7) to represent the genome S in the phylogenetic analysis of the 24 coronaviruses.
T128	17831-17922	Sentence	denotes	Description and comparison of DNA sequences are still important subjects in bioinformatics.
T129	17923-18280	Sentence	denotes	DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness.
T130	18281-18450	Sentence	denotes	In this paper, we propose four conditional multinomial distributions about each nucleotide for complete genome sequence based on the inter-nucleotide distance sequences.
T131	18451-18758	Sentence	denotes	From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species.
T132	18759-18901	Sentence	denotes	Therefore, it is straightforward to generate a phylogenetic tree based on the Euclidean distances of complete multinomial composition vectors.
T133	18902-19045	Sentence	denotes	In order to test the validity of our method, we select the complete genome sequences of 24 coronaviruses which were used by Liu et al. (2007) .
T134	19046-19135	Sentence	denotes	The phylogenetic tree can be gotten through the distance matrices using the UPGMA method.
T135	19136-19291	Sentence	denotes	Fig. 4 is the phylogenetic tree of the 24 genome sequences based on the distance matrix of the complete multinomial composition vector, using UPGMA method.
T136	19292-19383	Sentence	denotes	We find that the tree is mainly consistent with the tree constructed by Liu et al. (2007) .
T137	19384-19480	Sentence	denotes	Fig. 4 also indicates that SARS-CoVs are not closely related to any groups and form a new group.
T138	19481-19636	Sentence	denotes	Overall our results highlight that the conditional multinomial distribution profiles have the ability to extract more information from the genome sequence.
T139	19637-19813	Sentence	denotes	Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA.

T1

0-69

Sentence

denotes

Genome analysis with the conditional multinomial distribution profile

T2

71-79

Sentence

denotes

Abstract

T3

80-145

Sentence

denotes

The focus of the research is on the analysis of genome sequences.

T4

146-289

Sentence

denotes

Based on the inter-nucleotide distance sequence, we propose the conditional multinomial distribution profile for the complete genomic sequence.

T5

290-473

Sentence

denotes

These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences.

T6

474-639

Sentence

denotes

We use this distance measure to classify chromosomes according to species of origin, to build the phylogenetic tree of 24 complete genome sequences of coronaviruses.

T7

640-705

Sentence

denotes

Our results demonstrate the new method is powerful and efficient.

T8

707-828

Sentence

denotes

A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale.

T9

829-1066

Sentence

denotes

However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003) .

T10

1067-1177

Sentence

denotes

Consequently, considerable efforts have been made to seek for the alignment-free method for sequence analysis.

T11

1178-1417

Sentence

denotes

The first, mainly based on graphical representation of the sequence, is very convenient for studying several selected cases (Liao et al., 2005; Jeffrey, 1990; Nandy, 1994; Nandy and Nandy, 1995, 2003; Randic and Vracko, 2000; Randic et al.

T12

1418-1532

Sentence

denotes

2001 Randic et al. , 2003a Randic et al. ,b, 2006 Randic and Balaban, 2003; Randic, 2008; Zhang and Zhang, 1994) .

T13

1533-1692

Sentence

denotes

One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994) .

T14

1693-1788

Sentence

denotes

The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007) .

T15

1789-1973

Sentence

denotes

For this purpose one has to find representative descriptors that characterize an abstract mathematical representation of the biological sequence (Dai et al., 2006; He and Wang, 2002) .

T16

1974-2125

Sentence

denotes

A commonly used numerical characterization of the sequence is to consider binary sequences that describe the position of each nucleotide (Voss, 1992) .

T17

2126-2202

Sentence

denotes

Different approaches are described in a recent review (Nandy et al., 2006) .

T18

2203-2300

Sentence

denotes

Nair and Mahalashmi (2005) proposed the inter-nucleotide distance as a new DNA numerical profile.

T19

2301-2389

Sentence

denotes

Any DNA sequence can be converted into a unique numerical sequence with the same length.

T20

2390-2511

Sentence

denotes

In the representation, each number represents the distance of a nucleotide to the next occurrence of the same nucleotide.

T21

2512-2749

Sentence

denotes

Meanwhile Nair and Mahalashmi (2005) employed discrete Fourier transformation to the inter-nucleotide distance sequence and indicated that this method has a discriminatory capability for highlighting the promoter region of gene sequence.

T22

2750-2838

Sentence

denotes

However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction.

T23

2839-2994

Sentence

denotes

Afreixo et al. (2009) developed a new method to analyze the inter-nucleotide distance sequence and extracted some interesting features of the DNA sequence.

T24

2995-3120

Sentence

denotes

Four nucleotide internucleotide distance distributions and a global distance distribution were given to each genome sequence.

T25

3121-3267

Sentence

denotes

In each nucleotide internucleotide distance distribution, only the total number of three other nucleotides was considered (Afreixo et al., 2009) .

T26

3268-3382

Sentence

denotes

In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences.

T27

3383-3523

Sentence

denotes

Motivated by the aforementioned work, we construct four conditional multinomial distributions from four inter-nucleotide distance sequences.

T28

3524-3772

Sentence

denotes

In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k.

T29

3773-3859

Sentence

denotes

This multinomial distribution will be called the conditional multinomial distribution.

T30

3860-4009

Sentence

denotes

The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species.

T31

4010-4100

Sentence

denotes

This approach allows us to perform comparative analysis between complete genome sequences.

T32

4101-4273

Sentence

denotes

In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors.

T33

4274-4427

Sentence

denotes

These multinomial composition vectors are built on the relative error vectors of conditional multinomial distributions with k, where k is within a range.

T34

4428-4570

Sentence

denotes

The range of k is determined to ensure that the CMCV contains the largest amount of evolutionary information hidden in the whole genomic data.

T35

4571-4688

Sentence

denotes

We then define the evolutionary distance between two genomes based on their complete multinomial composition vectors.

T36

4689-4770

Sentence

denotes

The proposed method is tested by phylogenetic analysis on 24 coronavirus genomes.

T37

4771-4841

Sentence

denotes

Our results demonstrate that the new method is powerful and efficient.

T38

4842-4956

Sentence

denotes

A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet N ¼ fA,C,G,Tg.

T39

4957-5044

Sentence

denotes

The inter-nucleotide distance was originally introduced by Nair and Mahalashmi (2005) .

T40

5045-5134

Sentence

denotes

The global inter-nucleotide distance sequence referred to as GIN, was defined as follows:

T41

5135-5265

Sentence

denotes

Given a DNA sequence S ¼ s 1 ,s 2 , . . . ,s n , GINðmÞ ¼ k, where k¼min value of i such that s m ¼ s m þ i ,m þ i rn else k¼nÀ m.

T42

5266-5384

Sentence

denotes

We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given as GIN ¼ 6,9,1,2,3,6,3,1,3,2,1,0:

T43

5385-5515

Sentence

denotes

From the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide x A N.

T44

5516-5617

Sentence

denotes

Four inter-nucleotide distance sequences for the same short DNA segment used previously were given as

T45

5618-5725

Sentence

denotes

A similar inter-nucleotide distance sequence to the nucleotide x A N was defined by Afreixo et al. (2009) .

T46

5726-5811

Sentence

denotes

The four inter-nucleotide distance sequences for the short DNA fragment AGTTCTACCAGC:

T47

5812-5863

Sentence

denotes

considering that the symbolic sequence is circular.

T48

5864-6133

Sentence

denotes

The corresponding global distance sequence referred to as CIN is exemplified below for the same short DNA segment used previously, CIN ¼ 6, 9, 1, 2, 3, 9, 2, 1, 3, 3, 3, 5, which is slightly different from the non-circular approach used by Nair and Mahalashmi (2005) .

T49

6134-6348

Sentence

denotes

From the definition of the inter-nucleotide distance sequence, we clearly see that the total number of three other nucleotides was only considered in each inter-nucleotide distance sequence (Afreixo et al., 2009) .

T50

6349-6471

Sentence

denotes

In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence.

T51

6472-6602

Sentence

denotes

Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences.

T52

6603-6692

Sentence

denotes

We take the inter-nucleotide distance sequence about nucleotide A (CIN A ) as an example.

T53

6693-6883

Sentence

denotes

Considering the case of CIN A ¼ k (k ¼ 1,2, . . .), let p CjA ,p GjA and p TjA be the occurrence probabilities of nucleotides C, G, and T, respectively, between the nearest two nucleotide A.

T54

6884-7119

Sentence

denotes

If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution.

T55

7120-7285

Sentence

denotes

In fact, the joint probability function of ðN CjA ,N GjA ,N TjA Þ (the number of C, G, T, between the nearest two nucleotide A, respectively, given that CIN A ¼k) is

T56

7286-7291

Sentence

denotes

where

T57

7292-7547

Sentence

denotes

The nucleotide occurrence probability p CjA is estimated by the relative frequency TN CjA =ððkÀ1Þ Á N A Þ, where N A is the times of CIN A ¼k, TN CjA is the total number of C between the nearest two nucleotide A when the internucleotide distance CIN A ¼k.

T58

7548-7640

Sentence

denotes

The nucleotide occurrence probability p GjA and p TjA can be obtained in the similar method.

T59

7641-8006

Sentence

denotes

The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other.

T60

8007-8161

Sentence

denotes

From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution.

T61

8162-8279

Sentence

denotes

Mutations have been taking place randomly at molecular level and natural selections shape the direction of evolution.

T62

8280-8351

Sentence

denotes

Many neutral mutations may remain and play a role of random background.

T63

8352-8549

Sentence

denotes

One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011; Ding et al., 2010; Gao et al., 2006) .

T64

8550-8783

Sentence

denotes

In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background.

T65

8784-8953

Sentence

denotes

For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide x A fA,C,G,Tg.

T66

8954-9089

Sentence

denotes

For a certain pattern a from the conditional multinomial distribution, we can define the multinomial composition value pðaÞ as follows:

T67

9090-9294

Sentence

denotes

where the f x 0 ðajkÞ is the measured relative frequency of the pattern a, the relative frequency of the pattern a from the reference conditional multinomial distribution f x ðajkÞ can be computed by (1).

T68

9295-9525

Sentence

denotes

All these multinomial composition values can be sorted in some order to form a vector V x ðSjkÞ ¼ ðp x ða 1 jkÞ,p x ða 2 jkÞ, . . . ,p x ða m jkÞÞ for the genome S, where m denotes the total number of patterns under consideration.

T69

9526-9679

Sentence

denotes

Moreover, four vectors V A ðSjkÞ, V G ðSjkÞ, V C ðSjkÞ and V T ðSjkÞ are sorted in some order to form a vector VðSjkÞ that represents the whole genome S.

T70

9680-9813

Sentence

denotes

The vector defined by all these multinomial composition values is referred to as the k-order multinomial composition vector (k ÀMCV).

T71

9814-9938

Sentence

denotes

For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information.

T72

9939-10154

Sentence

denotes

The complete multinomial composition vector (CMCV) of the whole genome is the concatenation of VðSj3Þ,VðSj4Þ, . . . ,VðSjkÞ, denoted by CMCV (S, k), with the intention to use as much genomic information as possible.

T73

10155-10268

Sentence

denotes

We begin with the largest fragments of available DNA sequences, the chromosomes of eukaryotes listed in Table 1 .

T74

10269-10405

Sentence

denotes

The conditional multinomial distribution profile shown in Fig. 1 corresponds to three different chromosomes of Saccharomyces cerevisiae.

T75

10406-10588

Sentence

denotes

In the case of inter-nucleotide distance sequence CIN A (k¼5), we firstly convert the possible value of the ðN CjA ,N GjA ,N TjA Þ into onedimensional value by the order of alphabet.

T76

10589-10726

Sentence

denotes

We secondly plot the measured conditional multinomial distribution by bar and the reference conditional multinomial distribution by line.

T77

10727-10849

Sentence

denotes

We clearly see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of S. cerevisiae.

T78

10850-10900

Sentence

denotes

Three other cases are obtained in the similar way.

T79

10901-11042

Sentence

denotes

We see again the similarity between the conditional multinomial distribution profiles about a certain nucleotide for the various chromosomes.

T80

11043-11141

Sentence

denotes

If we repeat this experiment for the chromosomes of Caenorhabditis elegans we get the same result.

T81

11142-11349

Sentence

denotes

Again, when we plot the conditional multinomial distribution profile about a certain nucleotide we see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of C. elegans.

T82

11350-11418

Sentence

denotes

We demonstrate this with three chromosomes of C. elegans in Fig. 2 .

T83

11419-11721

Sentence

denotes

Again, while the pattern of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide is the same for all chromosomes of C. elegans, this pattern is distinctly different from the pattern of peaks and valleys in the S. cerevisiae profile about the same nucleotide.

T84

11722-11765

Sentence

denotes

Finally we repeat the experiment for Mouse.

T85

11766-11797

Sentence

denotes

The result is shown in Fig. 3 .

T86

11798-12091

Sentence

denotes

Once more we obtain a sequence of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide which are the same for all chromosomes of Mouse, and this pattern of peaks and valleys is different from the pattern in the S. cerevisiae and C. elegans profiles.

T87

12092-12236

Sentence

denotes

The complete multinomial composition vector of each complete genome provides a simple, easily computable signature that identifies each species.

T88

12237-12359

Sentence

denotes

The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence.

T89

12360-12469

Sentence

denotes

Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment.

T90

12470-12603

Sentence

denotes

Phylogenetic trees are generated by putting the pairwise distance matrix into UPGMA method in the PHYLIP package (Felsensein, 1989) .

T91

12604-12814

Sentence

denotes

The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses.

T92

12815-12972

Sentence

denotes

The 24 complete coronavirus genomes used in this paper were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses.

T93

12973-13075

Sentence

denotes

The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2 .

T94

13076-13158

Sentence

denotes

Generally, coronavirus can be classified into three groups according to serotypes.

T95

13159-13245

Sentence

denotes

Group I and group II contain mammalian viruses, whereas group III contains only avian.

T96

13246-13332

Sentence

denotes

Many investigations have attempted to identify the phylogenetic position of SARS-CoVs.

T97

13333-13845

Sentence

denotes

However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003; Rota et al., 2003) ; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Li o and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005) .

T98

13846-13974

Sentence

denotes

Based on the complete multinomial composition vector, we build the phylogenetic tree of the 24 coronaviruses listed in Table 2 .

T99

13975-14138

Sentence

denotes

The phylogenetic tree is built using the UPGMA programs in the PHYLIP package and the distance matrix is computed using the Euclidean distance (Felsensein, 1989) .

T100

14139-14374

Sentence

denotes

Our results based on analysis of the complete multinomial composition vector of 24 coronavirus genomes have some notable distinction from the previous phylogenetic study using an information-based similarity index (Yang et al., 2005) .

T101

14375-14554

Sentence

denotes

As can be seen from Fig. 4 , our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV).

T102

14555-14654

Sentence

denotes

Our results also show that group II, BCoV, BCoVL, BCoVM, etc., are grouped in a monophyletic clade.

T103

14655-14839

Sentence

denotes

This result is also mainly in accordance with the conclusions from the alignment-based method (Marra et al., 2003; Rota et al., 2003) and the alignment-free method (Liu et al., 2007) .

T104

14840-14930

Sentence

denotes

Moreover the Robinson-Foulds distance between our tree and the result of Liu's is only 26.

T105

14931-15040

Sentence

denotes

The selection of k in CMCV (S, k) is very important to capture rich evolutionary information of DNA sequence.

T106

15041-15118

Sentence

denotes

In the case of k ¼1, there is no nucleotide between two adjacent nucleotides.

T107

15119-15201

Sentence

denotes

In the case of k¼2, there is only one nucleotide between two adjacent nucleotides.

T108

15202-15277

Sentence

denotes

Therefore the multinomial composition value of a certain pattern a is zero.

T109

15278-15341

Sentence

denotes

The CMCV does not contain these multinomial composition values.

T110

15342-15433

Sentence

denotes

Certainly, a large value of k will give a vector containing finer evolutionary information.

T111

15434-15540

Sentence

denotes

However, many patterns will not occur in the conditional multinomial distribution with a large value of k.

T112

15541-15667

Sentence

denotes

From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered.

T113

15668-15845

Sentence

denotes

To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is. w 2 ÀTest scoring scheme:

T114

15846-16012

Sentence

denotes

For a fixed k, let a be a pattern in the conditional multinomial distribution, with its multinomial composition value pða,ijkÞ in genome i (could be found in k ÀMCV).

T115

16013-16247

Sentence

denotes

Define the expected multinomial composition value for pattern a to be the average of all composition values across all whole genomes, and denoted as, E½pðajkÞ i.e. E½pðajkÞ ¼ ð1=nÞ P n i ¼ 1 pða,ijkÞFassuming n genomes in the dataset.

T116

16248-16382

Sentence

denotes

The standard w 2 Àtest measures the deviation of a set of values from its expected value by summing up the deviations of each element.

T117

16383-16448

Sentence

denotes

Clearly, the higher value it has, the more valuable pattern a is.

T118

16449-16539

Sentence

denotes

Thus, we may define a score for the conditional multinomial distribution with a fixed k as

T119

16540-16639

Sentence

denotes

where the first sum is for all patterns of the conditional multinomial distribution with a fixed k.

T120

16640-16865

Sentence

denotes

We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution.

T121

16866-17080

Sentence

denotes

In this case, the largest score of conditional multinomial distribution, the k ÀMCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis.

T122

17081-17251

Sentence

denotes

We listed the score of the conditional multinomial distribution with a fixed k (within the range [3, 9] ) from the dataset of 24 complete coronavirus genomes in Table 3 .

T123

17252-17333

Sentence

denotes

The score of CMCV can be defined as sum of scores of k ÀMCV involved in the CMCV.

T124

17334-17435

Sentence

denotes

From Table 3 , it is clearly that there is no large difference after the 7À MCV is added in the CMCV.

T125

17436-17624

Sentence

denotes

Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k ÀMCV to the CMCV which will involve the kÀ MCV.

T126

17625-17718

Sentence

denotes

Form Table 3 , we can clearly see that the relative ratio of 7À MCV is the maximum 839 1408 .

T127

17719-17830

Sentence

denotes

Therefore, we select CMCV (S,7) to represent the genome S in the phylogenetic analysis of the 24 coronaviruses.

T128

17831-17922

Sentence

denotes

Description and comparison of DNA sequences are still important subjects in bioinformatics.

T129

17923-18280

Sentence

denotes

DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness.

T130

18281-18450

Sentence

denotes

In this paper, we propose four conditional multinomial distributions about each nucleotide for complete genome sequence based on the inter-nucleotide distance sequences.

T131

18451-18758

Sentence

denotes

From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species.

T132

18759-18901

Sentence

denotes

Therefore, it is straightforward to generate a phylogenetic tree based on the Euclidean distances of complete multinomial composition vectors.

T133

18902-19045

Sentence

denotes

In order to test the validity of our method, we select the complete genome sequences of 24 coronaviruses which were used by Liu et al. (2007) .

T134

19046-19135

Sentence

denotes

The phylogenetic tree can be gotten through the distance matrices using the UPGMA method.

T135

19136-19291

Sentence

denotes

Fig. 4 is the phylogenetic tree of the 24 genome sequences based on the distance matrix of the complete multinomial composition vector, using UPGMA method.

T136

19292-19383

Sentence

denotes

We find that the tree is mainly consistent with the tree constructed by Liu et al. (2007) .

T137

19384-19480

Sentence

denotes

Fig. 4 also indicates that SARS-CoVs are not closely related to any groups and form a new group.

T138

19481-19636

Sentence

denotes

Overall our results highlight that the conditional multinomial distribution profiles have the ability to extract more information from the genome sequence.

T139

19637-19813

Sentence

denotes

Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA.

CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSON TXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE Lectin_function

CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE Lectin_function

CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSON TXT 9 Projects