CORD-19:07e18fb2ba3bac9456e8afb29735fb91679840f9 JSONTXT 9 Projects

Annnotations TAB TSV DIC JSON TextAE Lectin_function

Id Subject Object Predicate Lexical cue
T1 0-69 Sentence denotes Genome analysis with the conditional multinomial distribution profile
T2 71-79 Sentence denotes Abstract
T3 80-145 Sentence denotes The focus of the research is on the analysis of genome sequences.
T4 146-289 Sentence denotes Based on the inter-nucleotide distance sequence, we propose the conditional multinomial distribution profile for the complete genomic sequence.
T5 290-473 Sentence denotes These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences.
T6 474-639 Sentence denotes We use this distance measure to classify chromosomes according to species of origin, to build the phylogenetic tree of 24 complete genome sequences of coronaviruses.
T7 640-705 Sentence denotes Our results demonstrate the new method is powerful and efficient.
T8 707-828 Sentence denotes A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale.
T9 829-1066 Sentence denotes However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003) .
T10 1067-1177 Sentence denotes Consequently, considerable efforts have been made to seek for the alignment-free method for sequence analysis.
T11 1178-1417 Sentence denotes The first, mainly based on graphical representation of the sequence, is very convenient for studying several selected cases (Liao et al., 2005; Jeffrey, 1990; Nandy, 1994; Nandy and Nandy, 1995, 2003; Randic and Vracko, 2000; Randic et al.
T12 1418-1532 Sentence denotes 2001 Randic et al. , 2003a Randic et al. ,b, 2006 Randic and Balaban, 2003; Randic, 2008; Zhang and Zhang, 1994) .
T13 1533-1692 Sentence denotes One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994) .
T14 1693-1788 Sentence denotes The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007) .
T15 1789-1973 Sentence denotes For this purpose one has to find representative descriptors that characterize an abstract mathematical representation of the biological sequence (Dai et al., 2006; He and Wang, 2002) .
T16 1974-2125 Sentence denotes A commonly used numerical characterization of the sequence is to consider binary sequences that describe the position of each nucleotide (Voss, 1992) .
T17 2126-2202 Sentence denotes Different approaches are described in a recent review (Nandy et al., 2006) .
T18 2203-2300 Sentence denotes Nair and Mahalashmi (2005) proposed the inter-nucleotide distance as a new DNA numerical profile.
T19 2301-2389 Sentence denotes Any DNA sequence can be converted into a unique numerical sequence with the same length.
T20 2390-2511 Sentence denotes In the representation, each number represents the distance of a nucleotide to the next occurrence of the same nucleotide.
T21 2512-2749 Sentence denotes Meanwhile Nair and Mahalashmi (2005) employed discrete Fourier transformation to the inter-nucleotide distance sequence and indicated that this method has a discriminatory capability for highlighting the promoter region of gene sequence.
T22 2750-2838 Sentence denotes However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction.
T23 2839-2994 Sentence denotes Afreixo et al. (2009) developed a new method to analyze the inter-nucleotide distance sequence and extracted some interesting features of the DNA sequence.
T24 2995-3120 Sentence denotes Four nucleotide internucleotide distance distributions and a global distance distribution were given to each genome sequence.
T25 3121-3267 Sentence denotes In each nucleotide internucleotide distance distribution, only the total number of three other nucleotides was considered (Afreixo et al., 2009) .
T26 3268-3382 Sentence denotes In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences.
T27 3383-3523 Sentence denotes Motivated by the aforementioned work, we construct four conditional multinomial distributions from four inter-nucleotide distance sequences.
T28 3524-3772 Sentence denotes In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k.
T29 3773-3859 Sentence denotes This multinomial distribution will be called the conditional multinomial distribution.
T30 3860-4009 Sentence denotes The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species.
T31 4010-4100 Sentence denotes This approach allows us to perform comparative analysis between complete genome sequences.
T32 4101-4273 Sentence denotes In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors.
T33 4274-4427 Sentence denotes These multinomial composition vectors are built on the relative error vectors of conditional multinomial distributions with k, where k is within a range.
T34 4428-4570 Sentence denotes The range of k is determined to ensure that the CMCV contains the largest amount of evolutionary information hidden in the whole genomic data.
T35 4571-4688 Sentence denotes We then define the evolutionary distance between two genomes based on their complete multinomial composition vectors.
T36 4689-4770 Sentence denotes The proposed method is tested by phylogenetic analysis on 24 coronavirus genomes.
T37 4771-4841 Sentence denotes Our results demonstrate that the new method is powerful and efficient.
T38 4842-4956 Sentence denotes A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet N ¼ fA,C,G,Tg.
T39 4957-5044 Sentence denotes The inter-nucleotide distance was originally introduced by Nair and Mahalashmi (2005) .
T40 5045-5134 Sentence denotes The global inter-nucleotide distance sequence referred to as GIN, was defined as follows:
T41 5135-5265 Sentence denotes Given a DNA sequence S ¼ s 1 ,s 2 , . . . ,s n , GINðmÞ ¼ k, where k¼min value of i such that s m ¼ s m þ i ,m þ i rn else k¼nÀ m.
T42 5266-5384 Sentence denotes We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given as GIN ¼ 6,9,1,2,3,6,3,1,3,2,1,0:
T43 5385-5515 Sentence denotes From the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide x A N.
T44 5516-5617 Sentence denotes Four inter-nucleotide distance sequences for the same short DNA segment used previously were given as
T45 5618-5725 Sentence denotes A similar inter-nucleotide distance sequence to the nucleotide x A N was defined by Afreixo et al. (2009) .
T46 5726-5811 Sentence denotes The four inter-nucleotide distance sequences for the short DNA fragment AGTTCTACCAGC:
T47 5812-5863 Sentence denotes considering that the symbolic sequence is circular.
T48 5864-6133 Sentence denotes The corresponding global distance sequence referred to as CIN is exemplified below for the same short DNA segment used previously, CIN ¼ 6, 9, 1, 2, 3, 9, 2, 1, 3, 3, 3, 5, which is slightly different from the non-circular approach used by Nair and Mahalashmi (2005) .
T49 6134-6348 Sentence denotes From the definition of the inter-nucleotide distance sequence, we clearly see that the total number of three other nucleotides was only considered in each inter-nucleotide distance sequence (Afreixo et al., 2009) .
T50 6349-6471 Sentence denotes In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence.
T51 6472-6602 Sentence denotes Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences.
T52 6603-6692 Sentence denotes We take the inter-nucleotide distance sequence about nucleotide A (CIN A ) as an example.
T53 6693-6883 Sentence denotes Considering the case of CIN A ¼ k (k ¼ 1,2, . . .), let p CjA ,p GjA and p TjA be the occurrence probabilities of nucleotides C, G, and T, respectively, between the nearest two nucleotide A.
T54 6884-7119 Sentence denotes If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution.
T55 7120-7285 Sentence denotes In fact, the joint probability function of ðN CjA ,N GjA ,N TjA Þ (the number of C, G, T, between the nearest two nucleotide A, respectively, given that CIN A ¼k) is
T56 7286-7291 Sentence denotes where
T57 7292-7547 Sentence denotes The nucleotide occurrence probability p CjA is estimated by the relative frequency TN CjA =ððkÀ1Þ Á N A Þ, where N A is the times of CIN A ¼k, TN CjA is the total number of C between the nearest two nucleotide A when the internucleotide distance CIN A ¼k.
T58 7548-7640 Sentence denotes The nucleotide occurrence probability p GjA and p TjA can be obtained in the similar method.
T59 7641-8006 Sentence denotes The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other.
T60 8007-8161 Sentence denotes From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution.
T61 8162-8279 Sentence denotes Mutations have been taking place randomly at molecular level and natural selections shape the direction of evolution.
T62 8280-8351 Sentence denotes Many neutral mutations may remain and play a role of random background.
T63 8352-8549 Sentence denotes One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011; Ding et al., 2010; Gao et al., 2006) .
T64 8550-8783 Sentence denotes In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background.
T65 8784-8953 Sentence denotes For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide x A fA,C,G,Tg.
T66 8954-9089 Sentence denotes For a certain pattern a from the conditional multinomial distribution, we can define the multinomial composition value pðaÞ as follows:
T67 9090-9294 Sentence denotes where the f x 0 ðajkÞ is the measured relative frequency of the pattern a, the relative frequency of the pattern a from the reference conditional multinomial distribution f x ðajkÞ can be computed by (1).
T68 9295-9525 Sentence denotes All these multinomial composition values can be sorted in some order to form a vector V x ðSjkÞ ¼ ðp x ða 1 jkÞ,p x ða 2 jkÞ, . . . ,p x ða m jkÞÞ for the genome S, where m denotes the total number of patterns under consideration.
T69 9526-9679 Sentence denotes Moreover, four vectors V A ðSjkÞ, V G ðSjkÞ, V C ðSjkÞ and V T ðSjkÞ are sorted in some order to form a vector VðSjkÞ that represents the whole genome S.
T70 9680-9813 Sentence denotes The vector defined by all these multinomial composition values is referred to as the k-order multinomial composition vector (k ÀMCV).
T71 9814-9938 Sentence denotes For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information.
T72 9939-10154 Sentence denotes The complete multinomial composition vector (CMCV) of the whole genome is the concatenation of VðSj3Þ,VðSj4Þ, . . . ,VðSjkÞ, denoted by CMCV (S, k), with the intention to use as much genomic information as possible.
T73 10155-10268 Sentence denotes We begin with the largest fragments of available DNA sequences, the chromosomes of eukaryotes listed in Table 1 .
T74 10269-10405 Sentence denotes The conditional multinomial distribution profile shown in Fig. 1 corresponds to three different chromosomes of Saccharomyces cerevisiae.
T75 10406-10588 Sentence denotes In the case of inter-nucleotide distance sequence CIN A (k¼5), we firstly convert the possible value of the ðN CjA ,N GjA ,N TjA Þ into onedimensional value by the order of alphabet.
T76 10589-10726 Sentence denotes We secondly plot the measured conditional multinomial distribution by bar and the reference conditional multinomial distribution by line.
T77 10727-10849 Sentence denotes We clearly see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of S. cerevisiae.
T78 10850-10900 Sentence denotes Three other cases are obtained in the similar way.
T79 10901-11042 Sentence denotes We see again the similarity between the conditional multinomial distribution profiles about a certain nucleotide for the various chromosomes.
T80 11043-11141 Sentence denotes If we repeat this experiment for the chromosomes of Caenorhabditis elegans we get the same result.
T81 11142-11349 Sentence denotes Again, when we plot the conditional multinomial distribution profile about a certain nucleotide we see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of C. elegans.
T82 11350-11418 Sentence denotes We demonstrate this with three chromosomes of C. elegans in Fig. 2 .
T83 11419-11721 Sentence denotes Again, while the pattern of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide is the same for all chromosomes of C. elegans, this pattern is distinctly different from the pattern of peaks and valleys in the S. cerevisiae profile about the same nucleotide.
T84 11722-11765 Sentence denotes Finally we repeat the experiment for Mouse.
T85 11766-11797 Sentence denotes The result is shown in Fig. 3 .
T86 11798-12091 Sentence denotes Once more we obtain a sequence of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide which are the same for all chromosomes of Mouse, and this pattern of peaks and valleys is different from the pattern in the S. cerevisiae and C. elegans profiles.
T87 12092-12236 Sentence denotes The complete multinomial composition vector of each complete genome provides a simple, easily computable signature that identifies each species.
T88 12237-12359 Sentence denotes The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence.
T89 12360-12469 Sentence denotes Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment.
T90 12470-12603 Sentence denotes Phylogenetic trees are generated by putting the pairwise distance matrix into UPGMA method in the PHYLIP package (Felsensein, 1989) .
T91 12604-12814 Sentence denotes The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses.
T92 12815-12972 Sentence denotes The 24 complete coronavirus genomes used in this paper were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses.
T93 12973-13075 Sentence denotes The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2 .
T94 13076-13158 Sentence denotes Generally, coronavirus can be classified into three groups according to serotypes.
T95 13159-13245 Sentence denotes Group I and group II contain mammalian viruses, whereas group III contains only avian.
T96 13246-13332 Sentence denotes Many investigations have attempted to identify the phylogenetic position of SARS-CoVs.
T97 13333-13845 Sentence denotes However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003; Rota et al., 2003) ; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Li o and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005) .
T98 13846-13974 Sentence denotes Based on the complete multinomial composition vector, we build the phylogenetic tree of the 24 coronaviruses listed in Table 2 .
T99 13975-14138 Sentence denotes The phylogenetic tree is built using the UPGMA programs in the PHYLIP package and the distance matrix is computed using the Euclidean distance (Felsensein, 1989) .
T100 14139-14374 Sentence denotes Our results based on analysis of the complete multinomial composition vector of 24 coronavirus genomes have some notable distinction from the previous phylogenetic study using an information-based similarity index (Yang et al., 2005) .
T101 14375-14554 Sentence denotes As can be seen from Fig. 4 , our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV).
T102 14555-14654 Sentence denotes Our results also show that group II, BCoV, BCoVL, BCoVM, etc., are grouped in a monophyletic clade.
T103 14655-14839 Sentence denotes This result is also mainly in accordance with the conclusions from the alignment-based method (Marra et al., 2003; Rota et al., 2003) and the alignment-free method (Liu et al., 2007) .
T104 14840-14930 Sentence denotes Moreover the Robinson-Foulds distance between our tree and the result of Liu's is only 26.
T105 14931-15040 Sentence denotes The selection of k in CMCV (S, k) is very important to capture rich evolutionary information of DNA sequence.
T106 15041-15118 Sentence denotes In the case of k ¼1, there is no nucleotide between two adjacent nucleotides.
T107 15119-15201 Sentence denotes In the case of k¼2, there is only one nucleotide between two adjacent nucleotides.
T108 15202-15277 Sentence denotes Therefore the multinomial composition value of a certain pattern a is zero.
T109 15278-15341 Sentence denotes The CMCV does not contain these multinomial composition values.
T110 15342-15433 Sentence denotes Certainly, a large value of k will give a vector containing finer evolutionary information.
T111 15434-15540 Sentence denotes However, many patterns will not occur in the conditional multinomial distribution with a large value of k.
T112 15541-15667 Sentence denotes From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered.
T113 15668-15845 Sentence denotes To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is. w 2 ÀTest scoring scheme:
T114 15846-16012 Sentence denotes For a fixed k, let a be a pattern in the conditional multinomial distribution, with its multinomial composition value pða,ijkÞ in genome i (could be found in k ÀMCV).
T115 16013-16247 Sentence denotes Define the expected multinomial composition value for pattern a to be the average of all composition values across all whole genomes, and denoted as, E½pðajkÞ i.e. E½pðajkÞ ¼ ð1=nÞ P n i ¼ 1 pða,ijkÞFassuming n genomes in the dataset.
T116 16248-16382 Sentence denotes The standard w 2 Àtest measures the deviation of a set of values from its expected value by summing up the deviations of each element.
T117 16383-16448 Sentence denotes Clearly, the higher value it has, the more valuable pattern a is.
T118 16449-16539 Sentence denotes Thus, we may define a score for the conditional multinomial distribution with a fixed k as
T119 16540-16639 Sentence denotes where the first sum is for all patterns of the conditional multinomial distribution with a fixed k.
T120 16640-16865 Sentence denotes We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution.
T121 16866-17080 Sentence denotes In this case, the largest score of conditional multinomial distribution, the k ÀMCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis.
T122 17081-17251 Sentence denotes We listed the score of the conditional multinomial distribution with a fixed k (within the range [3, 9] ) from the dataset of 24 complete coronavirus genomes in Table 3 .
T123 17252-17333 Sentence denotes The score of CMCV can be defined as sum of scores of k ÀMCV involved in the CMCV.
T124 17334-17435 Sentence denotes From Table 3 , it is clearly that there is no large difference after the 7À MCV is added in the CMCV.
T125 17436-17624 Sentence denotes Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k ÀMCV to the CMCV which will involve the kÀ MCV.
T126 17625-17718 Sentence denotes Form Table 3 , we can clearly see that the relative ratio of 7À MCV is the maximum 839 1408 .
T127 17719-17830 Sentence denotes Therefore, we select CMCV (S,7) to represent the genome S in the phylogenetic analysis of the 24 coronaviruses.
T128 17831-17922 Sentence denotes Description and comparison of DNA sequences are still important subjects in bioinformatics.
T129 17923-18280 Sentence denotes DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness.
T130 18281-18450 Sentence denotes In this paper, we propose four conditional multinomial distributions about each nucleotide for complete genome sequence based on the inter-nucleotide distance sequences.
T131 18451-18758 Sentence denotes From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species.
T132 18759-18901 Sentence denotes Therefore, it is straightforward to generate a phylogenetic tree based on the Euclidean distances of complete multinomial composition vectors.
T133 18902-19045 Sentence denotes In order to test the validity of our method, we select the complete genome sequences of 24 coronaviruses which were used by Liu et al. (2007) .
T134 19046-19135 Sentence denotes The phylogenetic tree can be gotten through the distance matrices using the UPGMA method.
T135 19136-19291 Sentence denotes Fig. 4 is the phylogenetic tree of the 24 genome sequences based on the distance matrix of the complete multinomial composition vector, using UPGMA method.
T136 19292-19383 Sentence denotes We find that the tree is mainly consistent with the tree constructed by Liu et al. (2007) .
T137 19384-19480 Sentence denotes Fig. 4 also indicates that SARS-CoVs are not closely related to any groups and form a new group.
T138 19481-19636 Sentence denotes Overall our results highlight that the conditional multinomial distribution profiles have the ability to extract more information from the genome sequence.
T139 19637-19813 Sentence denotes Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA.