Supplementary file of Mendizabal-Ruiz, et al. Genetic Signal Processing for DNA Sequence Clustering
Abstract
To test the methodology presented in our paper, we performed various sequence clusterings using different datasets previously tested in the literature (Hoang et al., 2015) . These include five genome datasets, i.e. mammal mitochondrion, influenza A virus, human rhinovirus, coronavirus and bacteria. Clustering results are shown below.
In the case of 31 mammal mitochondrial genomes (dataset A), a K7 clustering resulted in the classification of mammal families C1 Ursidae, C2 Bovidae and Cetacean, C3 Cercopithecidae, C4 Hominidae, C6 Canidae and C7 Felidae. Cluster 5 contains the Rodentia and Lagomorpha with the Erinaceidae outgroup. This is the result of the early divergence between carnivores, which breaks them in the three tight families, leaving one cluster for the three furthest groups.
The second dataset (B) evaluates the neuraminidase gene, encoded in the sixth segment of the influenza A genome, that is one of two surface genes that identifies the serotypes responsible for virulence and pathogenicity. The K5 clustering resulted in a single swap between a H5N1 and a H1N1 strain in C3 and C4 respectively. This result is similar to that of k-mer clustering method tested by Hoang, et al. The Human rhinovirus dataset (C) did not cluster according to the four genome groups analyzed in HRV-A, HRV-B, HRV-C and HEV-C as outgroup. Instead, K4 clustering divided HRV-A in three groups, with C1 sharing both furthest branches of HRV-A, HRV-C and HEV-C, and C3 with all of HRV-B (Palmenberg et al., 2009) . The great diversity of HRV-A created compact clusters distant enough from each other to cluster the loosest groups in each other.
Coronavirus complete genome dataset (D) resulted in accurate K5 clustering with the sole exception of SARS strain ZJ01. Most viruses clustered according to the species they affect. Murine hepatitis virus and Bovine coronavirus, two groups typically grouped together, clustered separately in C2 and C3 respectively, and C1 clustered most Human coronavirus.
Finally, we analyzed eight families of whole bacterial genomes (dataset E). The most interesting result is the segregation of both E. coli strains into the separate C5 and C7 clusters, and apart from the Shigellas and Yersinias, that clustered in C8. Both Bacilli genera also clustered separately in C1 and C6. The three remaining clusters held two bacterial families, according to their phylogenetic relatedness.
|