> top > docs > PMC:1630425 > spans > 1282-9489

PMC:1630425 / 1282-9489 JSON TXT

Annnotations TAB JSON ListView MergeView

2_test

{"project":"2_test","denotations":[{"id":"17049089-2336393-1690110","span":{"begin":146,"end":147},"obj":"2336393"},{"id":"17049089-8506142-1690111","span":{"begin":353,"end":354},"obj":"8506142"},{"id":"17049089-11331237-1690112","span":{"begin":525,"end":526},"obj":"11331237"},{"id":"17049089-15653627-1690113","span":{"begin":865,"end":866},"obj":"15653627"},{"id":"17049089-10563018-1690113","span":{"begin":865,"end":866},"obj":"10563018"},{"id":"17049089-12387731-1690113","span":{"begin":865,"end":866},"obj":"12387731"},{"id":"17049089-10810083-1690113","span":{"begin":865,"end":866},"obj":"10810083"},{"id":"17049089-10827456-1690114","span":{"begin":1032,"end":1033},"obj":"10827456"},{"id":"17049089-12611807-1690115","span":{"begin":1182,"end":1183},"obj":"12611807"},{"id":"17049089-14734312-1690116","span":{"begin":1385,"end":1387},"obj":"14734312"},{"id":"17049089-9190805-1690117","span":{"begin":1499,"end":1501},"obj":"9190805"},{"id":"17049089-12831886-1690117","span":{"begin":1499,"end":1501},"obj":"12831886"},{"id":"17049089-15716010-1690117","span":{"begin":1499,"end":1501},"obj":"15716010"},{"id":"17049089-15716010-1690118","span":{"begin":1583,"end":1585},"obj":"15716010"},{"id":"17049089-11895567-1690119","span":{"begin":1870,"end":1872},"obj":"11895567"},{"id":"17049089-11331237-1690120","span":{"begin":1933,"end":1934},"obj":"11331237"},{"id":"17049089-15501469-1690121","span":{"begin":2580,"end":2582},"obj":"15501469"},{"id":"17049089-11331237-1690122","span":{"begin":3708,"end":3709},"obj":"11331237"},{"id":"17049089-11895567-1690123","span":{"begin":4444,"end":4446},"obj":"11895567"},{"id":"17049089-12387731-1690124","span":{"begin":4640,"end":4641},"obj":"12387731"},{"id":"17049089-11331237-1690125","span":{"begin":5082,"end":5083},"obj":"11331237"},{"id":"17049089-15501469-1690126","span":{"begin":5415,"end":5417},"obj":"15501469"},{"id":"17049089-11331237-1690127","span":{"begin":5646,"end":5647},"obj":"11331237"},{"id":"17049089-12611807-1690128","span":{"begin":5648,"end":5649},"obj":"12611807"},{"id":"17049089-11895567-1690129","span":{"begin":5650,"end":5652},"obj":"11895567"},{"id":"17049089-15501469-1690129","span":{"begin":5650,"end":5652},"obj":"15501469"},{"id":"17049089-14734312-1690129","span":{"begin":5650,"end":5652},"obj":"14734312"},{"id":"17049089-12537561-1690130","span":{"begin":6064,"end":6066},"obj":"12537561"},{"id":"17049089-14751999-1690130","span":{"begin":6064,"end":6066},"obj":"14751999"},{"id":"17049089-11895567-1690131","span":{"begin":6435,"end":6437},"obj":"11895567"},{"id":"17049089-8506142-1690132","span":{"begin":6772,"end":6773},"obj":"8506142"},{"id":"17049089-11331237-1690133","span":{"begin":6916,"end":6917},"obj":"11331237"},{"id":"17049089-15501469-1690134","span":{"begin":7180,"end":7182},"obj":"15501469"},{"id":"17049089-11895567-1690135","span":{"begin":7484,"end":7486},"obj":"11895567"}],"text":"Background\nThe use of iterative functions for scale independent representation of biological sequences was first proposed well over a decade ago [1]. Despite its earlier popularity, that original proposition, designated as Chaos Game Representation (CGR), was soon found objectionable on the grounds of equivalence to standard Markov transition tables [2]. We have subsequently examined that equivalence and have shown that, quite the contrary, it is the Markovian transition that is a special solution of the CGR procedure [3]. The reader is referred to that report for a brief revision of earlier work on iterative functions for representation of sequence succession. The equivalence between iterative maps and genomic signatures (more exactly that the latter comes as a special solution of the former) has also been noted its simpler, and faster implementation [4-7], and it has even lead to a number of web-based and stand alone applications, including a function, CHAOS, available in the popular bioinformatics library EMBOSS [8].\n\nWhy CGR?\nApproaching sequence analysis by analyzing the distribution of succession patterns, which is to say, of L-tuple (oligomer) frequencies [9], is advantageous when the sequence similarity is low because alignment algorithms cease to recognize common motifs that are inexactly conserved, as recently illustrated for the SCOP protein database [10]. Furthermore, oligomeric frequencies are a natural genomic signature for analysis of collections of isolates [11-13] where, again, the advantages of the CGR representation did not go unnoticed [13]. These observations argue for the value of having a neutral format, one that is scale and succession-independent, to represent Biological sequences. We have used CGR as the starting point to develop just such a general procedure, which we designated as Universal Sequence Maps, USM [14]. The USM procedure provides a bijective mapping (see also [3]) between any symbolic sequence and a unique position in the USM unit hypercube. Furthermore, the distances between map positions were found to be associated with sequence dissimilarity. Because the procedure itself is not dependent on the scale targeted by its analysis (length of motifs, Markov order or memory length, depending on the technique chosen) this is of both fundamental and practical relevance.\n\nSimilarity overestimation\nThe CGR/USM representation of sequences offers fundamental advantages, related with its scale-independency, that make it particularly suitable to investigate the entropy distributions in nucleotide sequences [15]. That study in particular played a significant role in motivating the density kernel development reported here. It was then observed that using symmetric kernels in the Parzen window method, such as the Gaussian distribution function, to represent density of sequence patterns in iterative maps would be affected by some loss of resolution caused by overlap of memory lengths, e.g. different lengths of the sequence pattern being given the same weight because they were at the same Euclidean distance to an arbitrary position in the map. The artifactual loss of resolution can be graphically understood by noting that the projections of two sequence units can be very close to each other in the sequence map for two reasons, only one of them being directly proportional to sequence similarity described in Figure 1. The other, confounding, possibility is that place two units of distinct sequences are placed at close quarters in the sequence map because they happen to be at opposite ends of adjacent quadrants. This rare but unavoidable occurrence causes a bias in previously proposed distance metrics, including our own [3].\nFigure 1 Illustration of the unidirectional USM procedure for the sequence, \"ACTGCCC\". For a nucleotide sequence, it consists of two iterative CGR operations in each direction, forward and reverse. The circled symbol indicates the first position iterated – see text for discussion on determination of seeding position. Each subsequent position is calculated moving half the distance to the edge with the corresponding unit. As shown in [3] the density of points in the unitary square is a generalization of Markov transition matrices. The distribution bias caused by the edge effects can be addressed in two different routes. On the one hand it can be modeled and discounted in the final results, as we have done in previous work [14]. Specifically, see Figure 3 of that report for a representation of the (biased) null distribution obtained for different sized alphabets. The alternative solution, which we have also pursued [6] is to identify a Boolean implementation of Universal Sequence Maps, designated as bUSM, which removes the source of distance overestimation at each of the of the scales accommodated by the numerical resolution of the computing environment being used. That report also offers a detailed algebraic description of the causes for the similarity over-estimation for metrics based maximum distances at any dimension (derived from equation 6 in [3]). Neither of those two solutions described, however, helps representing the density distribution of individual sequences such that the sequences themselves can be compared without having to return to the pair-wise distances between their units. The fundamental attraction of such a solution, which we only partially succeeded in [15] using Gaussian Parzen kernels, would be that it captures the fundamental characteristics of the sequence, such as its information content.\n\nTowards an accurate kernel density function\nAs shown in previous work discussed above [3,9,14-16], the fact that similar sequence distance is not equidistant (Euclidean) to the preceding position is a serious limitation to sequence comparison. On the other hand, it was also shown [17] that pursuing discriminant analysis using representations that are not constrained by predefined scales or succession orders, even when those scales are systematically screened such as in variable length Markov models [18-20], leads to more accurate models of sequences. The two results put together point to the need for a density kernel that resolves scale (succession order) such that predictive patterns can be investigated more efficiently in the iterative map representation.\nIn spite of the attractiveness of iterative functions in general, and the bidirectional USM implementation [14] in particular, for enabling the scale independent representation of motifs in biological sequences, its segmentation is still typically approached by considering quadrants that only correspond to Markovian transition. This usage indeed has no fundamental advantage over the better established use of fixed order transition matrices [2]. To go beyond that, the fractal nature [21] set by the consecutive scales that can be spanned by multi-order or fractal order segmentations [3,17] has to be accommodated by the density estimation procedure. As mentioned earlier, we have subsequently approached the investigation of the distributions of motifs of variable length using continuous kernels on the USM positions, such as the Gaussian kernel [15] with only partial success. The limitation of that approach, clearer in the investigation of local entropy, reflected the indetermination of sequence similarity between equidistant positions in the map, which had actually been anticipated, and mathematically modeled, by the original USM proposition [14]. In this report we solve the problem by identifying a kernel for density distribution in the USM space that matches the fractal succession of Markov transition orders. For ease of representation, the procedure will be illustrated for nucleotide sequences, which is also the scale for which unidirectional USM is equivalent to the CGR procedure. This achievement enables the computation of scale independent distribution of motifs in biological sequences which allows different scales to be combined in the same representation of density of motifs in the sequence. The critical advance is that it is no longer affected by the sequence composition itself or, which is the same thing, by the position in the iterative map."}