PMC:1592098 / 4510-16115 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"16999860-15585529-1688525","span":{"begin":9224,"end":9226},"obj":"15585529"}],"text":"Results and discussion\nIn [11], we presented an algorithm for computing the quartet distance between trees of arbitrary degree. It runs in time O(n2d2) and space O(n2), where n is the number of leaves in each tree and d is the maximal degree found in either of the trees. In this paper, we present an improved algorithm running in time O(n + |V||V'| min{id, id'}) and space O(n + |V||V'|), where |V| and |V'| are the number of internal (non-leaf) nodes in the two input trees, and id and id' are the maximal degree of an internal node, when disregarding edges to leaves, in the two trees.\n\nTime analysis for different types of trees\nThe terms |V|, id, |V'| and id' are all clearly O(n), but on the other hand neither |V| and id nor |V'| and id' are independent. Intuitively, if there are a lot of internal nodes in a tree, they will not have a very large internal degree. We address in this section, how this dependency will affect the running time for different types on trees.\nThe worst theoretical running time of the algorithm for calculating the quartet distance presented above is O(n3). Consider a tree with an internal node of degree n2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUbqaaiabikdaYaaaaaa@2F13@, connected to n2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUbqaaiabikdaYaaaaaa@2F13@ internal nodes of degree three each connected to two leaves, see Fig. 2. Such a tree has n leaves, O(n) internal nodes and a maximal internal degree that is O(n). If the algorithm is run on two such trees, the running time will be O(n3). In d-ary trees (trees where all internal nodes have degree d) |V| = O (nd MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUbqaaiabdsgaKbaaaaa@2F72@), the time complexity of calculating the quartet distance will be O (n2d MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUnaaCaaaleqabaGaeGOmaidaaaGcbaGaemizaqgaaaaa@309B@).\nFigure 2 A worst case input tree for the algorithm. A tree with an internal node of degree n2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUbqaaiabikdaYaaaaaa@2F13@, connected to n2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUbqaaiabikdaYaaaaaa@2F13@ internal nodes of degree three each connected to two leaves. This tree has both a maximal degree of O(n) and at the same time O(n) inner nodes. The two cases above are somewhat extreme. The first case has a very large gap between the maximal and minimal degree of internal nodes, while the second has little or no gap. The theoretical performance of the algorithm on the two types of trees reflects this difference. Let dmin = min{minv dv, minv' dv'}, be the minimal degree of any internal node in either tree, then each tree has O(ndmin MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUbqaaiabdsgaKnaaBaaaleaaieGacqWFTbqBcqWFPbqAcqWFUbGBaeqaaaaaaaa@33C0@) internal nodes and the time complexity is O (n2dmin2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUnaaCaaaleqabaGaeGOmaidaaaGcbaGaemizaq2aa0baaSqaaGqaciab=1gaTjab=LgaPjab=5gaUbqaaiabikdaYaaaaaaaaa@35DC@ min{id, id'}). If min{id, id'} is O(dmin2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGKbazdaqhaaWcbaacbiGae8xBa0Mae8xAaKMae8NBa4gabaGaeGOmaidaaaaa@333E@) the time usage of calculating the quartet distance will be O (n2). In the following section we will do practical verification of the theoretical results in this section.\n\nExperimental running times\nThe graphs in Fig. 3 show the running time for comparing worst case trees (see Fig. 2), (d-ary trees and random trees. There are six types of (d-ary trees; binary, 6-ary, 15-ary, and 30-ary and two types of random trees; r8s-based (see [12]) and trees with random topologies. The trees generated by r8s are binary, but by contracting edges, we can get trees of arbitrary degree (contracting an edge e connecting nodes u and v means removing u and e and attaching the rest of u's edges to v). Each edge is contracted with a probability that is inversely proportional with its length, i.e. a short edge has a higher probability of being contracted than a long edge. The trees with random topology are generated by adding leaves one by one, starting with a tree of size 2. A leaf can be added by attaching it to a random inner node or by spliting a random edge with a new node, to which the leaf is attached.\nFigure 3 Experimental running times. The running time of the algorithm for worst case trees, d-ary trees and random trees. The lines plots the polynomials c. ni, where c is a fitted constant and i ∈ [1, 4]. The two bottommost plots are in log-scale on both the x- and y-axis. The running time for worst case input trees (as described in the previous section) is O(n3), because such trees have O(n) internal nodes and min{id, id'} is O(n). This is supported by the first graph in Fig. 3, which shows that the plot of the polynomial n3 (representing the best sum-of-squares fit of the polynomial c·n3 to the data-points) is closest to the plot of the running times with regard to slope.\nThe running time on the algorithm on d-ary trees is O(n2d MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6gaUnaaCaaaleqabaGaeGOmaidaaaGcbaGaemizaqgaaaaa@309B@). The plots of the running times in the second graph are parallel, and one of them is plotted directly on top of a plot of the polynomial n2 (here c·n2 is fitted to the data-points for each d separately; the different colors match the d colors). This supports that they all have a running time of O(n2) for fixed d's. The graph also shows that higher degrees give lower running times, which is also expected. The reason why the algorithm is more than twice as fast on 6-ary trees than it is on binary trees, is that the number of internal nodes in 6-ary trees is less than in binary trees, and even though |V| is O(n) in both cases, that difference has an impact on the running time. The last graph shows the running time of the algorithm on trees created as either random trees (each topology is equally likely) or trees simulated using r8s (with edge contraction as described above). We have no theoretical running time for this data, but the graphs show that the running time is O(n2). Even though the plotted data is only a small random sample, this indicates that many pairs of trees actually have the property that min{id, id'} is O(dmin2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGKbazdaqhaaWcbaacbiGae8xBa0Mae8xAaKMae8NBa4gabaGaeGOmaidaaaaa@333E@). Therefore it is not unreasonable to expect that our algorithm runs in time O(n2) on trees used in practice. All experiments were performed on a standard PC (Pentium 4, 3 GHz, 1 Gb Ram) running Linux Fedora Core 3.\n\nComparison with existing algorithms\nIn Fig. 4 we compare the running time of the new algorithm with the O(n2d2) and O(n3) time algorithms from [11] on random and r8s simulated trees. In Fig. 5 we compare the running time of the new algorithm with the other two algorithms on Buneman and refined Buneman trees built for a range of Pfam [13] derived distance matrices using the tool in [14]. Buneman and refined Buneman trees are not binary unless this is well supported by the input distance matrix, and thus represent the kind of trees which can only be compared by methods which allow for trees of arbitrary degrees. In both experiments, the O(n3) time algorithm is slowest by a large margin for all plotted sizes of n. The new algorithm is consistently faster than the O(n2d2) time algorithm for the r8s (with edge contraction) simulated trees and for the Buneman and refined Buneman trees. For random trees the previous O(n2d2) time algorithm is slightly faster in practice. This difference is most likely caused by the additional overhead of precomputing the sums used by the new O(n2d) time algorithm compared to the previous O(n2d2) time algorithm in order to improve the asymptotic worst case running time (see method section). For trees of low degree, the overhead might dominate the factor d by which the worst case running time of the new algorithm is improved. The observed running times on random trees thus indicate that over selection of random trees consists of trees of low degree, whereas the r8s simulated, Buneman, and refined Buneman trees are trees with a few nodes of high degree which more than compensate for the additional overhead of dealing with nodes of low degree. In conclusion, we find that the experimental comparison of the new algorithm with the previously developed algorithms indicate that the new algorithm not only improves on the theoretical asymptotic running time, but also improves the running time in practice if the input trees contain a few nodes of high degree.\nFigure 4 Comparisons with earlier algorithms on random and r8s trees. The running time for the new algorithm compared to the existing O(n2d2) and O(n3) time algorithms for random and r8s trees. The lines are fitted polynomial c. n2, for the case of the new algorithm (denoted n2d in the legend) and the O(n2d2) algorithm, and the polynomial c. n3 for the O(n3) algorithm. The plot is in log-scale on both the x- and y-axis.\nFigure 5 Comparisons with earlier algorithms on Buneman and refined Buneman trees. The running time for the new O(n2d) time algorithm compared to the existing O(n2d2) and O(n3) time algorithms on the Buneman and the refined Buneman trees for range of Pfam based distance matrices. The plot is in log-scale on both the x- and y-axis."}