> top > docs > PMC:4527362 > spans > 0-87

PMC:4527362 / 0-87 JSONTXT

An investigation into inter- and intragenomic variations of graphic genomic signatures Abstract Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. Background Alongside DNA barcoding, [1] and Klee diagrams [2], Chaos Game Representation (CGR) patterns of genomic segments have been proposed as another method for the classification and identification of genomic sequences [3, 4]. The concept of genomic signature was first introduced in [5], as being any specific quantitative characteristic of a DNA genomic sequence that is pervasive along the genome of the same organism, while being dissimilar for DNA sequences originating from different organisms. Initial studies [3, 6] suggesting that short fragments of genomic sequences retain most of the characteristics of the genome of origin indicated that such genomic signatures exist. In particular, the Chaos Game Representation (CGR) of a DNA sequence, a graphic representation of its sequence composition, was proposed in [3] as having both the pervasiveness and differentiability properties necessary for it to qualify as a genomic signature. Indeed, CGRs of genomic DNA sequences have been shown to be genome- and species-specific, see, e.g., [3, 4, 6–12]. Note that CGR patterns of mtDNA sequences can be different from those of DNA sequences from the major genome of the same organism, and that large scale quantitative analyses, at all taxonomic levels, of the hypothesis that CGR can play the role of a genomic signature for genomic sequences have not, to our knowledge, been performed. The long term objective of this research is to find out whether CGR can play the role of genomic signature for genomic DNA sequences, and can be used to identify and classify genomic sequences at all taxonomic levels. To this end, the objective of this study is to quantitatively assess the usability of CGR for classification of genomic sequences at the kingdom level, as well as to assess various distances that can be used to compare CGRs of genomic sequences for this purpose. We first analyze 508 fragments, 150 kbp (kilo base pairs) long, spanning single complete chromosomes of six organisms, each representing a different kingdom: chromosome 21 of Homo sapiens, chromosome 4 of Saccharomyces cerevisiae, chromosome 1 of Arabidopsis thaliana, chromosome 14 of Plasmodium falciparum, the genome of Escherichia coli, and the genome of Pyrococcus furiosus, for a total length of 76,200 kbp analyzed. We analyze the intergenomic and intragenomic variation of CGR genomic signatures of these sequences by using six different distances: Structural Dissimilarity Index (DSSIM) [13], Euclidean distance, Pearson correlation distance [14], Manhattan distance [15], approximated information distance [16], and a distance defined here, based on an idea from computer vision, called descriptor distance. For each of the six distances, we visualize the results by computing Molecular Distance Maps, [12], which represent sequences as points in a two-dimensional or three-dimensional space, and thus display all their interrelationships simultaneously. The resulting Molecular Distance Maps show a good clustering, with genomic sequences originating from the same genome being largely grouped together, and separated from sequences belonging to genomes of different organisms. We observe that, in some of the cases where the clustering is suboptimal, the computation of three-dimensional Molecular Distance Maps resolves what appeared to be cluster overlaps in the two-dimensional Molecular Distance Maps. Using the “ground-truth” that sequences from the same genomes should have similar structural characteristics and thus be grouped together, while those from genomes of different organisms should be separated, we assess the six distances by combining three different quality measures: correlation to an idealized cluster distance, silhouette accuracy, and histogram overlap. We conclude that, for this dataset, DSSIM and the descriptor distance perform best according to these measures. To maximize the diversity within each species, we also analyze a set of 526 fragments, 150 kbp long, sampled from the entire genomes of the aforementioned six organisms, for a total length of 78,900 kbp analyzed. The resulting Molecular Distance Maps are very similar to the ones in the first experiment, and the distance ranking is also the same, confirming the preceding results. Lastly, we provide some preliminary evidence of this method’s applicability to classifying genomic DNA sequences at lower taxonomic levels by comparing 240 genomic sequences, 150 kbp long, sampled from the entire genome of Homo sapiens (class Mammalia, order Primates) with 210 genomic sequences, 150 kbp long, sampled from the entire genome of Mus musculus (mouse, class Mammalia, order Rodentia) for an additional length of 67,500 kbp analyzed. While a clear separation of sequences by genome is indeed achieved, we observe that the distance ranking is quite different compared to the previous two experiments, indicating that different distances may have to be used for comparing genomic sequences at different taxonomic levels. Note that early analyses of genomic sequences with regard to similarities in the relative abundances of oligonucleotides of lengths k=1,…,6 exists and include [17–25]. Also, several alignment-free methods that use fixed-length word frequencies have been used for phylogenomic analysis of DNA sequences, [26–28]. These methods include statistical studies of word frequency within a DNA sequence [5, 29–34], or employ k-words and the Markov model to obtain information about DNA sequences [35–39]. Iterated map methods for DNA sequence comparison include CGR-based analyses, see [3, 40–46], and such alignment-free methods have been successfully applied for sequence comparison [4, 11, 12, 47–53]. The initial reports on CGRs of genomic sequences [3, 6] contained mostly qualitative assessments of CGR patterns of whole genes. In [54], several comparisons of eukaryotic genomic sequences, including within-species comparisons, were reported, using di-, tri-, and tetranucleotide relative abundance distance (k=2,3,4). In [25] di- and tetranucleotide abundance profiles (k=2,k=4) were compared for genomic collections from genomes of 5 gram-negative proteobacteria (including 2 complete genomes), 3 gram-positive bacteria, 2 mycoplasmas (complete genomes), 2 cyanobacteria (1 complete genome), and 3 thermophilic archaea (1 complete genome), using the δ ∗ distance which computes the average absolute difference of the dinucleotide relative abundance values. In [4], several datasets of up to 36 genomic DNA sequences were analyzed, and in [9] some various-length sequences were analyzed based on computing Euclidean distances between frequencies of their k-mers, for k=1,…,8. Subsequently, [10] computed the Euclidean distance between frequencies of k-mers (k≤5) for the analysis of 125 GenBank DNA sequences from 20 bird species and the American alligator. In [47], 27 microbial genomes were analyzed to find implications of 4-mer frequencies (k=4) on their evolutionary relationships. In [16], 20 mammalian complete mtDNA sequences were analyzed using the “similarity metric”, for k=7. In [50] a multigene dataset of 33 genes for 9 bacteria and one archaea species, as well as the whole genomes of a set of 16 γ-proteobacteria were analyzed, using values of k between 1 and 10, and Euclidean and χ 2 distances. In [11] a collection of 26 complete mitochondrial genomes was analyzed, using the Euclidean distance and an “image distance”, with a value of k=10. In [55] a megabase-scale phylogenomic analysis of the Reptilia was reported, that compared frequency distributions of 8-mer oligonucleotides (k=8) using Euclidean distance. Another study, [56], analyzed 459 bacteriophage genomes and compared them with their host genomes to infer host-phage relationships, by computing Euclidean distances between frequencies of k-mers for k=4. In [57], 75 complete HIV genome sequences were compared using the Euclidean distance between frequencies of 6-mers (k=6), in order to group them in subtypes. In [58] several datasets were analyzed (109 complete genomes of prokaryotes and eukaryotes, 34 prokaryote and chloroplast genomes, mitochondrial genomes of 64 vertebrates, and 62 complete genomes of alpha proteobacteria) using values of k=5,6 for protein-coding genes and k=11,12 for whole genomes, with two distances: chord distance and piecewise distance. In [12] a dataset of 3,176 complete mtDNA sequences was analyzed using an image distance, DSSIM, and a value of k=9, and several Molecular Distance Maps were obtained which displayed sequences’ interrelationships at several taxonomic levels (phylum Vertebrata, kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates). The main contributions of this paper are: We tested and confirmed for an extensive dataset, of a total length of approximately 174Mbp, the hypothesis that CGR images of genomic DNA sequences can play the role of a (graphic) genomic signature, meaning that they have a desirable genome- and species-specificity. The dataset comprised 150 kbp fragments taken from genomes of six organisms, one from each of the six kingdoms of life. This was augmented by a set of 150 kbp fragments randomly sampled from all chromosomes of M. musculus, as a test-case of this method’s applicability at lower taxonomic levels.We assessed the performance of six different distances in this context, and this analysis included both same-genome and different-genome DNA fragment pairs. For several of these distances, the intragenomic values were overall smaller than intergenomic values, suggesting that this method could separate DNA genomic fragments belonging to different genomes, based on their CGRs.We showed that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In particular, we determined that the DSSIM distance and the descriptor distance, adapted from computer vision for this application, were best able to differentiate sequences originating from different genomes at the kingdom level. Both these distances essentially compare the k-mer composition of DNA sequences (herein k=9).Based on preliminary data, we suggested the use of three-dimensional Molecular Distance Maps for improved visualization of the simultaneous interrelationships within a given set of genomic sequences. Further analysis is needed to explore this method’s potential to differentiate genomic sequences originating from closely related species (e.g. within the same order). Additional refinements of the distances considered may have to be defined for optimal genomic DNA sequence identification and classification at very low taxonomic levels. Methods In this section we first describe the dataset used for our analysis, then present an overview of the three main steps of the method, and conclude with a description of the six distances that we considered. Dataset We used the complete genomes from six organisms, each representing one of the six kingdoms of life. For the first experiment, we used one complete chromosome from each genome, see Table 1. For additional information about the dataset see [59], Appendix B. Table 1 Dataset for the first experiment: NCBI accession numbers of the complete chromosomes considered, in increasing order of their NCBI accession number In order to have relatively comparable numbers of DNA sequences for each organism, we chose the longest chromosomes for all organisms except H. sapiens, for which the shortest chromosome was chosen. The DNA sequences in the NCBI database are represented as strings of letters “A”, “C”, “G”, “T”, and “N” which represent the four nucleotides Adenine, Cytosine, Guanine, Thymine, and “unidentified Nucleotide”, respectively. For our analysis we ignored all letters “N”. In S. cerevisiae and E. coli there were no ignored letters, and in P. falciparum and P. furiosus the number of ignored letters is of the order of 0.001 % of the length of the sequence. In H. sapiens this number is 27 %, and in A. thaliana is 0.54 %. In H. sapiens, in particular, 96.4 % of these ignored letters exist in centromeric and telomeric regions of the chromosome. The resulting genomic DNA sequences were divided into successive, non-overlapping, contiguous fragments, each 150 kbp long. When the last sequence was shorter than 150 kbp, it was not included in the analysis. This resulted in 234 fragments for H. sapiens, 30 fragments for E. coli, 10 fragments for S. cerevisiae, 201 fragments for A. thaliana, 21 fragments for P. falciparum, and 12 fragments for P. furiosus, for a total of 508 DNA fragments, see Table 2. Table 2 The first experiment: Organisms considered, total length of the chromosome (respectively genome), number of ignored letters “N”, and number of DNA fragments (sequences) obtained by splitting a single complete chromosome per organism into consecutive, non-overlapping, equal length (150 kbp) contiguous fragments To maximize the diversity within each species, the dataset of the second experiment comprised fragments randomly sampled from each chromosome of the six chosen organisms, as follows. After deleting all “N” nucleotides, each chromosome was divided into successive, non-overlapping, contiguous fragments, each 150 kbp long. When the last fragment was shorter than 150 kbp, it was not included in the analysis. Next, for each chromosome we selected randomly 10 such fragments to represent the chromosome, see [59], Appendix B. In the cases where there were fewer than 10 fragments in a chromosome, all of them were considered. In the cases of E. coli and P. furiosus, we retained all complete fragments of the genome. This resulted in 240 fragments for H. sapiens, 30 fragments for E. coli, 73 fragments for S. cerevisiae, 50 fragments for A. thaliana, 121 fragments for P. falciparum, and 12 fragments for P. furiosus, for a total of 526 fragments. Overview The method we used to analyze and classify genomic sequences has three steps: (i) generate graphical representations (images) of each DNA sequence using Chaos Game Representation (CGR), (ii) compute all pairwise distances between these images, and (iii) visualize the interrelationships implied by these distances as two- or three-dimensional maps, using Multi-Dimensional Scaling (MDS). CGR is a method introduced by Jeffrey [3] in 1990 and studied in, e.g., [3, 6, 7, 11, 60–63] as a way to visualize the structure of a DNA sequence, This method associates an image to each DNA sequence as follows. Starting from a unit square with corners labelled A, C, G, and T, and the center of the square as the starting point, the image is obtained by successively plotting each nucleotide as the middle point between the current point and the corner labelled by the nucleotide to be plotted. If the generated square image has a size of 2k×2k pixels, then every pixel represents a distinct k-mer: A pixel is black if the k-mer it represents occurs in the DNA sequence, otherwise it is white. CGR images of genetic DNA sequences originating from various species show patterns such as squares, parallel lines, rectangles, triangles, and also complex fractal patterns, Fig. 1. Fig. 1 29×29 CGR images of 150 kbp genomic DNA sequences from H. sapiens, E. coli, S. cerevisiae, A. thaliana, P. falciparum, and P. furiosus For step (i), a slight modification of the original CGR was used, introduced by Deschavanne [4]: a k-th order FCGR (frequency CGR) is a 2k×2k matrix that can be constructed by dividing the CGR plot into a 2k×2k grid, and defining the element a ij as the number of points that are situated in the corresponding grid square. A second order FCGR is shown below, where N w is the number of occurrences of the oligonucleotide w in the sequence s: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} FCGR_{2}(s) = \left(\begin{array}{cccc} N_{CC} & N_{GC} & N_{CG} & N_{GG} \\ N_{AC} & N_{TC} & N_{AG} & N_{TG} \\ N_{CA} & N_{GA} & N_{CT} & N_{GT} \\ N_{AA} & N_{TA} & N_{AT} & N_{TT} \\ \end{array} \right). \end{array} $$ \end{document}FCGR2(s)=NCCNGCNCGNGGNACNTCNAGNTGNCANGANCTNGTNAANTANATNTT. The (k+1)-th order F C G R k+1(s) can be obtained by replacing each element N X in F C G R k(s) with four elements \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} \left(\begin{array}{cc} N_{CX} & N_{GX} \\ N_{AX} & N_{TX} \\ \end{array} \right) \end{array} $$ \end{document}NCXNGXNAXNTX where X is a sequence of length k over the alphabet {A, C, G, T}. For step (ii), after computing the FCGR matrices for each of the 150 kbp sequences in a given dataset, the goal was to measure “distances” between every two CGR images. There are many distances that can be defined and used for this purpose, [64]. One of the goals of this study was to identify what distance is better able to differentiate the structural differences of various genomic DNA sequences and classify them based on the species they belong to. In this paper we use six different distances: Structural Dissimilarity Index (DSSIM), descriptor distance (adapted from computer vision for this application), Euclidean distance, Manhattan distance, Pearson correlation distance, and approximated information distance. For step (iii), after computing all possible pairwise distances we obtained six different distance matrices. To visualize the inter-relationships between sequences implied by each of the distance matrices, and to thus visually assess each of the distances, we used Multi-Dimensional Scaling (MDS). MDS is an information visualization technique introduced by Kruskal in [65]. MDS takes as input a distance matrix that contains the pairwise distances among a set of items (here the items are the 150 kpb DNA sequences analyzed). The output of MDS is a spatial representation of the items in a common Euclidean space, wherein each item is represented as a point and the spatial distance between any two points corresponds to the distance between the items in the distance matrix. Objects with a small pairwise distance will result in points that are close to each other, while objects with a large pairwise distance will become points that are far apart. The combination of CGR/DSSIM/MDS was first proposed in [66], [12] as a tool to quantitatively measure and display the interrelationships among a set of complete mitochondrial sequences. The outputs of this method, called Molecular Distance Maps, are two-dimensional maps wherein each point represents a mitochondrial genome, and the spatial distances between any two points correspond to the differences between the structural composition of the corresponding DNA sequences. The ideal Molecular Distance Map is a placement of n items as points in an (n−1)-dimensional space. The two-dimensional Molecular Distance Map is simply an approximation, a flattening of this highly-dimensional space onto the plane, which may sometimes result in erroneous positioning of some points. Increasing the dimensionality of the Molecular Distance Map often results in a more accurate representation of the real interrelationships between sequences, as embodied in the original distance matrix. Distances In this section we describe and formally define each of the six distances used in our analysis: DSSIM, descriptor distance (adapted from computer vision for this application), Euclidean, Manhattan, Pearson, and approximated information distance. Structural Similarity Index, SSIM, was introduced in [13] for the purpose of assessing the degree of similarity between two images. Given two images X, Y as n×n matrices having as elements integers ranging in the interval [0,L], SSIM computes three factors (luminance, contrast and structure) and combines them to obtain a similarity value. However, instead of computing a global similarity between the two images, each image is divided into 11×11 sliding square windows X ij(Y ijrespectively) with i,j=1,⋯,n−10 which move pixel by pixel to eventually cover the entire image. The SSIM similarity of any given pair of images is then computed by comparing their corresponding square windows. In addition, an 11×11 circular symmetric Gaussian weighting function \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$W \in \mathbb {R}^{11 \times 11}$\end{document}W∈ℝ11×11 with a fixed standard deviation of 1.5, normalized to unit sum (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\sum _{p=1}^{11}\sum _{q=1}^{11} W_{\textit {pq}}=1$\end{document}∑p=111∑q=111Wpq=1), is used. Then, the mean μ x,i,j (μ y,i,j for Y), variance σ x,i,j (σ y,i,j for Y) and correlation σ xy,i,j are computed, as follows: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} \mu_{x,i,j}=\sum\limits_{p=1}^{11} \sum\limits_{q=1}^{11}W_{pq}X^{ij}_{pq} \end{array} $$ \end{document}μx,i,j=∑p=111∑q=111WpqXpqij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} \sigma_{x,i,j}=\sqrt{\sum\limits_{p=1}^{11} \sum\limits_{q=1}^{11}W_{pq}(X^{ij}_{pq}-\mu_{x,i,j})^{2}} \end{array} $$ \end{document}σx,i,j=∑p=111∑q=111Wpq(Xpqij−μx,i,j)2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} \sigma_{xy,i,j}=\sum\limits_{p=1}^{11} \sum\limits_{q=1}^{11} W_{pq}(X^{ij}_{pq}-\mu_{x,i,j})(Y^{ij}_{pq}-\mu_{y,i,j}) \end{array} $$ \end{document}σxy,i,j=∑p=111∑q=111Wpq(Xpqij−μx,i,j)(Ypqij−μy,i,j) where A pq denotes the (p,q) element of the matrix A. Based on these values, the luminance l(X ij,Y ij), contrast c(X ij,Y ij) and structure s(X ij,Y ij) are computed as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} l(X^{ij},Y^{ij})&=&\frac{2\mu_{x,i,j}\mu_{y,i,j}+C_{1}}{\mu_{x,i,j}^{2}+\mu_{y,i,j}^{2}+C_{1}}\\ c(X^{ij},Y^{ij})&=&\frac{2\sigma_{x,i,j}\sigma_{y,i,j}+C_{2}}{\sigma_{x,i,j}^{2}+\sigma_{y,i,j}^{2}+C_{2}}\\ s(X^{ij},Y^{ij})&=&\frac{\sigma_{xy,i,j}+C_{3}}{\sigma_{x,i,j}\sigma_{y,i,j}+C_{3}}\\ \end{array} $$ \end{document}l(Xij,Yij)=2μx,i,jμy,i,j+C1μx,i,j2+μy,i,j2+C1c(Xij,Yij)=2σx,i,jσy,i,j+C2σx,i,j2+σy,i,j2+C2s(Xij,Yij)=σxy,i,j+C3σx,i,jσy,i,j+C3 where C 1=(0.01)2, C 2=(0.03)2, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$C_{3}=\frac {C_{2}}{2}$\end{document}C3=C22. Then, these three factors are combined to get \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} SSIM(X^{ij},Y^{ij})=l(X^{ij},Y^{ij}) c(X^{ij},Y^{ij})s(X^{ij},Y^{ij}) \end{array} $$ \end{document}SSIM(Xij,Yij)=l(Xij,Yij)c(Xij,Yij)s(Xij,Yij) and finally, the SSIM index used to evaluate the overall image similarity is computed as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} SSIM(X,Y)=\frac{1}{(n-10)^{2}}\sum\limits_{i=1}^{n-10}\sum\limits_{j=1}^{n-10} SSIM(X^{ij},Y^{ij}). \end{array} $$ \end{document}SSIM(X,Y)=1(n−10)2∑i=1n−10∑j=1n−10SSIM(Xij,Yij). In theory, the values for SSIM range in the interval [−1,1] with the similarity being 1 between two identical images, 0, for example, between a black image and a white image, and −1 if the two images are negatively correlated; that is, SSIM (X,Y)=−1 if and only if X and Y have the same luminance μ and every pixel x i of image X has the inverted value of the corresponding pixel y i=2μ−x i in Y. To compute the distance rather than the similarity between two images, we calculate DSSIM (X,Y) = 1-SSIM (X,Y). Consequently, the range of DSSIM is the interval [0,2]: two identical images will result in a DSSIM distance of 0, while two images that are the negatives of each other would result in a DSSIM distance of 2. For defining the descriptor distance we adapted for this application the spatial pyramid matching approach of [67], which is used to calculate hierarchical image descriptors. The descriptor distance between two FCGRs \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$X,Y \in \mathbb {N}^{2^{k} \times 2^{k}}$\end{document}X,Y∈ℕ2k×2k aims to compare a combination of several different “descriptors”, that is, a combination of several different aspects, of the two given FCGRs. A descriptor is a vector characterized by parameters m and r, as well as r intervals, where m is the size of the non-overlapping windows in which the FCGR is divided (scale of the comparison), and the r intervals represent the “granularity” of the analysis, in that they define the intervals of numbers of k-mer occurrences that are considered significant. For a given m≤k and r, and intervals [a 0,a 1),[a 1,a 2),⋯,[a r−1,a r) such that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bigcup _{i=0}^{r-1} [a_{i},a_{i+1})=[0,\infty)$\end{document}⋃i=0r−1[ai,ai+1)=[0,∞) and [a i,a i+1)∩[a j,a j+1)=∅∀i,j with i≠j, a decriptor is constructed as follows. Starting from the top-left corner, we divide each of the two FCGR matrices X and Y into non-overlapping submatrices of size 2m×2m. This procedure results in 4k−m submatrices X ij and Y ij with i,j=1,⋯,2k−m, which will be pairwise compared. The choice of the r intervals, called “bins”, points to the fact that, rather than considering the finest granularity, we are interested in a coarser comparison. This means that, instead of a computationally expensive pairwise comparison of all possible numbers of occurrences of k-mers, we are interested only in certain “bins” of such numbers. For example, in our case, we use r=5 and consider only 5 different bins, that is only k-mers with number of occurrences: 0 (not occurring), 1 (one occurrence), 2 (two occurrences), between 2 and 5, between 5 and 20, and greater than 20 (most frequent). Formally, we use r=5 and [0,∞)=[0,1)∪[1,2)∪[2,5)∪[5,20)∪[20,∞) as the 5 bins. Afterwards, we compute for every X ij a vector vec\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$X_{\textit {ij}} = \frac {1}{(2^{m} \times 2^{m})} (b_{1}, b_{2}, \cdots, b_{r})$\end{document}Xij=1(2m×2m)(b1,b2,⋯,br) where b i=|{x∈X ij:a i−1≤x 0 \} \right|}{m}. \end{array} $$ \end{document}Aα={x∈V∣Sα(x)>0}m. Obviously, the silhouette cluster accuracy ranges in [0,1] with a high accuracy being desirable. For assessing the relative overlap of the histograms, consider any two clusters C i and C j with i≠j (for example, C 1 is the H. sapiens cluster and C 4 the A. thaliana cluster). We compare the two sets of intragenomic distances C i– C i and C j– C j with the set of intergenomic distances C i– C j. For a distance d α, we divide the range from min(d α) to max(d α) in this dataset into 100 bins of size \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r = \frac {\max (d_{\alpha }) - \min (d_{\alpha })}{100}$\end{document}r=max(dα)−min(dα)100 and count the distances which fall into this bin: c i,i[ℓ] denotes bin ℓ containing distances from C i– C i and c i,j[ℓ] denotes bin i containing distances from C i– C j. For ℓ=1,…,100 we let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} c_{i',j'}[\ell] = &| \{ \{x,y\}\mid x\in C_{i'},y\in C_{j'} \text{ and } x \neq y \\ &\text{and } (\ell-1)\cdot r < d_{\alpha}(x,y) \le \ell\cdot r \} |. \end{array} $$ \end{document}ci′,j′[ℓ]=|{{x,y}∣x∈Ci′,y∈Cj′andx≠yand(ℓ−1)·r

Document structure show

projects that have annotations to this span

There is no project