PMC:1630425 / 3627-6839
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/1630425","sourcedb":"PMC","sourceid":"1630425","source_url":"https://www.ncbi.nlm.nih.gov/pmc/1630425","text":"Similarity overestimation\nThe CGR/USM representation of sequences offers fundamental advantages, related with its scale-independency, that make it particularly suitable to investigate the entropy distributions in nucleotide sequences [15]. That study in particular played a significant role in motivating the density kernel development reported here. It was then observed that using symmetric kernels in the Parzen window method, such as the Gaussian distribution function, to represent density of sequence patterns in iterative maps would be affected by some loss of resolution caused by overlap of memory lengths, e.g. different lengths of the sequence pattern being given the same weight because they were at the same Euclidean distance to an arbitrary position in the map. The artifactual loss of resolution can be graphically understood by noting that the projections of two sequence units can be very close to each other in the sequence map for two reasons, only one of them being directly proportional to sequence similarity described in Figure 1. The other, confounding, possibility is that place two units of distinct sequences are placed at close quarters in the sequence map because they happen to be at opposite ends of adjacent quadrants. This rare but unavoidable occurrence causes a bias in previously proposed distance metrics, including our own [3].\nFigure 1 Illustration of the unidirectional USM procedure for the sequence, \"ACTGCCC\". For a nucleotide sequence, it consists of two iterative CGR operations in each direction, forward and reverse. The circled symbol indicates the first position iterated – see text for discussion on determination of seeding position. Each subsequent position is calculated moving half the distance to the edge with the corresponding unit. As shown in [3] the density of points in the unitary square is a generalization of Markov transition matrices. The distribution bias caused by the edge effects can be addressed in two different routes. On the one hand it can be modeled and discounted in the final results, as we have done in previous work [14]. Specifically, see Figure 3 of that report for a representation of the (biased) null distribution obtained for different sized alphabets. The alternative solution, which we have also pursued [6] is to identify a Boolean implementation of Universal Sequence Maps, designated as bUSM, which removes the source of distance overestimation at each of the of the scales accommodated by the numerical resolution of the computing environment being used. That report also offers a detailed algebraic description of the causes for the similarity over-estimation for metrics based maximum distances at any dimension (derived from equation 6 in [3]). Neither of those two solutions described, however, helps representing the density distribution of individual sequences such that the sequences themselves can be compared without having to return to the pair-wise distances between their units. The fundamental attraction of such a solution, which we only partially succeeded in [15] using Gaussian Parzen kernels, would be that it captures the fundamental characteristics of the sequence, such as its information content.","divisions":[{"label":"title","span":{"begin":0,"end":25}},{"label":"p","span":{"begin":26,"end":1366}},{"label":"figure","span":{"begin":1367,"end":1902}},{"label":"label","span":{"begin":1367,"end":1375}},{"label":"caption","span":{"begin":1377,"end":1902}},{"label":"p","span":{"begin":1377,"end":1902}}],"tracks":[{"project":"2_test","denotations":[{"id":"17049089-15501469-1690121","span":{"begin":235,"end":237},"obj":"15501469"},{"id":"17049089-11331237-1690122","span":{"begin":1363,"end":1364},"obj":"11331237"},{"id":"17049089-11895567-1690123","span":{"begin":2099,"end":2101},"obj":"11895567"},{"id":"17049089-12387731-1690124","span":{"begin":2295,"end":2296},"obj":"12387731"},{"id":"17049089-11331237-1690125","span":{"begin":2737,"end":2738},"obj":"11331237"},{"id":"17049089-15501469-1690126","span":{"begin":3070,"end":3072},"obj":"15501469"}],"attributes":[{"subj":"17049089-15501469-1690121","pred":"source","obj":"2_test"},{"subj":"17049089-11331237-1690122","pred":"source","obj":"2_test"},{"subj":"17049089-11895567-1690123","pred":"source","obj":"2_test"},{"subj":"17049089-12387731-1690124","pred":"source","obj":"2_test"},{"subj":"17049089-11331237-1690125","pred":"source","obj":"2_test"},{"subj":"17049089-15501469-1690126","pred":"source","obj":"2_test"}]}],"config":{"attribute types":[{"pred":"source","value type":"selection","values":[{"id":"2_test","color":"#93eaec","default":true}]}]}}