> top > docs > CORD-19:9e7bb369224888a04493301fe9b7e1d4e31531c1

CORD-19:9e7bb369224888a04493301fe9b7e1d4e31531c1 JSONTXT

Phylogenetic Trees Abstract Phylogenetic inference is an attractive means to reconstruct transmission histories and epidemics. However, there is not a perfect correspondence between transmission history and virus phylogeny. Both node height and topological differences may occur, depending on the interaction between within-host evolutionary dynamics and between-host transmission patterns. To investigate these interactions, we added a within-host evolutionary model in epidemiological simulations and examined if the resulting phylogeny could recover different types of contact networks. To further improve realism, we also introduced patient-specific differences in infectivity across disease stages, and on the epidemic level we considered incomplete sampling and the age of the epidemic. Second, we implemented an inference method based on approximate Bayesian computation (ABC) to discriminate among three well-studied network models and jointly estimate both network parameters and key epidemiological quantities such as the infection rate. Our ABC framework used both topological and distance-based tree statistics for comparison between simulated and observed trees. Overall, our simulations showed that a virus time-scaled phylogeny (genealogy) may be substantially different from the between-host transmission tree. This has important implications for the interpretation of what a phylogeny reveals about the underlying epidemic contact network. In particular, we found that while the within-host evolutionary process obscures the transmission tree, the diversification process and infectivity dynamics also add discriminatory power to differentiate between different types of contact networks. We also found that the possibility to differentiate contact networks depends on how far an epidemic has progressed, where distance-based tree statistics have more power early in an epidemic. Finally, we applied our ABC inference on two different outbreaks from the Swedish HIV-1 epidemic. Over the past few years, epidemiological models for infectious diseases have incorporated network structure: each individual has a set of contacts to whom they can pass the PLOS Computational Biology | infection. However, collecting data to develop a good representation of the network is very challenging. The increasing availability of sequence data provide additional information: previous work has shown that it is possible to recover social network features from viral phylogenies. Until now, however, it was assumed that within-host evolution is negligible in the reconstruction. Here, we propose an approach based on approximate Bayesian computation to infer network structure from a virus phylogeny, explicitly including a model for within-host evolution. In addition, we incorporate other important heterogeneity factors such as individual-based transmission rates and infectivity varying by diseasestage. We show that in some situations the within-host virus diversity adds valuable signal to identify network structure, but in other situations it muddles the underlying contact structure. Inference of Transmission Network Structure from HIV Phylogenetic Trees PLOS Computational Biology | a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Infectious diseases that are directly transmitted spread over contact networks, where each individual host can be represented by a node with a finite set of contacts (edges) via which they can transmit the pathogen. The structure of these networks is a major determinant of the pathogen transmission dynamics and possible control strategies [1] . For example, it has been suggested that human sexual contact networks are characterized by a power-law-like degree distribution [2] [3] [4] which, in a specific range of the scaling exponent, results in an infinite variance of the network's degree distribution. This implies the absence of an epidemic threshold, making prophylactic strategies for sexually transmitted diseases very challenging. The main issue with contact network-based epidemiology has been the difficulty of collecting individual-and population-level data needed to develop an accurate representation of the underlying host population's contact structure. This has led to an interest in methods to infer information about host contact networks from epidemiological data. Previously, Britton and O'Neill [5] estimated the parameters of an Erdős-Rényi network and a stochastic epidemic process on it using recovery times of infected hosts, and Groendyke et al. extended the approach to exponential-family random graph models [6] using covariate information [7] . The use of other common epidemiological measures such as the basic reproduction number (R 0 ), epidemic peak size, duration and final size, has been shown to be effective in classifying the degree of heterogeneity in a population's unobserved contact structure [8] . During the course of an epidemic, the pathogen spreads over a subset of edges in the social network forming a subgraph that is the realized transmission history. Keeping track of who transmits to whom and assuming that every individual may be infected only once and by only one other individual, such a transmission history can be represented as a rooted tree (transmission tree) [9] . However, full transmission histories are rarely observed and commonly available epidemiological data such as diagnosis-recovery times of infected people may provide information on who was infected, when, and for how long, but it provides limited information on who acquired infection from whom [10] . Since pathogens evolve over a transmission history, the analysis of pathogen genetic sequences taken from different hosts provides a way to infer the most likely donor and recipient [11] introducing constraints on the space of possible transmission trees, which are a trace of the underlying contact network. Phylodynamics [12] focuses on linking methods of phylogenetic analysis with epidemiological models under the assumption that if the evolution of a pathogen occurs sufficiently fast, transmission histories become "recorded" in the betweenhost pathogen phylogeny (phylogenetic tree). Phylodynamic analyses of HIV-1 have shown that asymmetry in viral phylogenies may be indicative of heterogeneity in transmissions [13] ; networks with more heterogeneous degree distributions yield transmission trees with smaller mean cluster sizes, shorter mean branch lengths, and somewhat higher tree imbalance than networks with relatively homogeneous degree distributions. However, it has been argued that these direct effects are relatively modest for dynamic networks [14] or if only a small fraction of infected individuals are sampled [15] . Also, factors other than contact rate, such as high infectiousness during acute infection, may have a more dramatic impact on asymmetry [15] . However, previous studies as well as more recent papers [16] [17] [18] , assume that the unobserved transmission tree is identical to the reconstructed time-scaled phylogeny (virus genealogy), i.e. the internal nodes of the genealogy correspond to transmission events between hosts over time and within-host diversity is fundamentally ignorable. This is unrealistic since all coalescent events in a pathogen phylogeny occur within hosts, pushing the genealogy node heights further back in time than the nodes of the transmission tree, known as the pre-transmission interval [19] . In addition, the order of coalescent events may not correspond to the order of transmission events but rather reflect within-host dynamics [20] [21] [22] . The objective of this study was to include within-host evolution, disease stage, and individual specific transmission rates to improve the realism of social network reconstruction. We simulated epidemic spread on three prototypic network types and investigated the behavior of several tree statistics, including both topological imbalance measures and tree-based distance measures. In addition, we investigated the effect of varying epidemic size, varying sampling proportion, as well as heterochronous sampling on the tree statistics. Finally, we analyzed data from two different epidemiological scenarios of spread among injecting drug users (IDU) in the Swedish HIV-1 epidemic using approximate Bayesian computation (ABC) for network model choice and parameter inference following the algorithm defined in [23] . We found that virus geneaologies can differ from the underlying transmission tree in both topology and branch length and, therefore, that meaningful inference of social networks needs to take within-host evolution into account. Networks. We considered three prototypic network models to represent population contact structure: the Erdős-Rényi (ER) random graph [24] , the Barabási-Albert (BA) graph [25] and the Watts-Strogatz (WS) graph [26] with low rewiring probability (Fig 1) . These three networks are characterized by different degree distributions and amount of clustering. The degree of a node in a network is the number of connections to other nodes it has and the degree distribution is the probability distribution of these degrees over the whole network. The ER model generates networks with Poisson degree distributions (in the limit), i.e. pðkÞ ¼ e À y y À k k! . The BA model is generated by using a linear preferential attachment algorithm that produces scale-free networks with a power-law degree distribution p(k) / k −α with α = 3. The WS model has a Dirac degree distribution centered at K (all nodes have the same degree) when the rewiring probability tends to 0. If the rewiring probability tends to 1, the degree distribution is Poisson. For intermediate values, the shape of the degree distribution has a pronounced peak at k = K and decays exponentially for large |k − K|. A WS network is characterized by a relatively homogeneous structure, as all nodes have more or less the same degree, and by a high degree of local clustering as opposed to ER and BA networks. The networks were simulated using the igraph package in R [27] . Epidemic model. We simulated outbreaks from a susceptible-infected-removed (SIR) type dynamic [28] of HIV-1 spread in the susceptible population on each type of contact network. We compared differences among the transmission trees obtained by simulating epidemic spread under four increasingly more realistic transmission (S1 Fig) The first model specification assumed that the rate of transmission per contact between a susceptible and an infected individual, λ, is constant over time. The removal rate of infected individuals γ is also constant over time and includes both diagnosis and death. We denoted with p the probability of being sampled at the moment of diagnosis (DNA sequences obtained from the virus of the diagnosed patient). We assumed that (i) diagnosis coincides with treatment start, (ii) the rate of transmission after treatment start is negligible, and (iii) nobody goes off treatment. We believe these assumptions to be reasonable for our analysis since Sweden has already achieved the 90-90-90 target set by UNAIDS in 2014 [29] according to which (i) 90% of all people leaving with HIV will know their HIV status, (ii) 90% of all people with diagnosed HIV infection will receive sustained antiretroviral therapy, and (iii) 90% of all people receiving antiretroviral therapy will have viral suppression [30] . In the second model specification we considered three stages of HIV infection (acute, chronic, and pre-AIDS). Here, the transmission rates are dependent on the disease stage of the infected individual and denoted λ 1 , λ 2 , and λ 3 (S1 Fig). We assumed the removal rate to be independent on the disease stage. The acute stage was assumed to last 30 days for each individual [31] , the chronic stage had variable length described by an exponential random variable T 2 with a mean of 8 years [32] , and the pre-AIDS stage lasted until death or diagnosis. The three transmission rates were calculated to preserve the individual total infectivity during their infectious period in order to make results comparable with the first model specification. This derivation is shown in S1 Text. In the third model specification we modeled individual variability of transmission rates. We did that by multiplying the constant transmission rate λ (as in the first model specification) with a log-normal variable Z i for each i individual (node) with location parameter −σ 2 /2 and scale σ in order to preserve the mean of λ (i.e. E(λ i ) = λ since E(Z i ) = 1). Finally, the fourth model specification combined stage-specific infectivity with individual heterogeneity. The sampling process was modeled explicitly in each model specification. We used Gillespie's next-reaction method [33, 34] to simulate disease spread according to the above outlined model specifications until there were no more infectives or until a predefined number of samples. Keeping track of who-infects-whom, each epidemic simulation yields a transmission history. The phylogeny of pathogens such as HIV-1 collected from infected persons in an epidemic reveals a considerable amount of information about the underlying transmission history since mutations are typically accumulated faster than transmission occurs. The common assumption that the internal nodes of a phylogeny correspond to transmission events between hosts over time is, however, unrealistic because transmitted lineages must already exist in the donor at the time of transmission. Thus, neglecting the time difference between the common ancestor and the transmission event (i.e. the pre-transmission interval, [19] ) will bias the estimated time of transmission backwards in time. Furthermore, new infections may come from HIV-1 variants derived from a latent reservoir (lineages can persist for long time in the host [35, 36] ), and the order of coalescent events may not correspond to the order of transmission events but reflect instead withinhost dynamics [21] . To address these issues, we used a two-phase coalescent model described by a linear growth from a single transmitted variant (transmission bottleneck) to a maximum population size followed by either stabilization or decline of the effective population size [21] . Let N(t) denote the viral population size at time t since infection (expressed in days), such that where α 1 is the population size (i.e. the number of virus variants in a given host) at the moment of infection, β 1 is the rate of population size increase until t x (time at maximum diversity), and β 2 the rate of decline after the maximum. We assume α 1 = 1, β 1 = 3, t x uniformly distributed between t a = 2 and t b = 8 (years) and Virus genealogies conditional on a transmission history are simulated by generating random coalescence times for each person in the tree. Random coalescence times are generated from the inverse cumulative density function (derivation in [21] ) where u is a uniform random variate on (0, 1), t is the current time along the forward time axis, b is the linear rate of change (β 1 or β 2 depending on the phase), a is the starting population size (1 in the first phase and β 1 t x in the second phase) and k is the number of extant sampled lineages in a given host. For each host we draw random values of t x and β 2 from the prior distributions. Starting at the last transmission or sampling event, we first move to the next event along the reverse time axis, which is either a transmission event, a rate change, or the time at which the current host was infected. If the event is a transmission event, then k is incremented and a random coalescence time is generated. If that time occurs before (along the reverse time axis) the next event, then two random extant lineages in the sample are selected to coalesce; or if not, then time is moved to the time of the next event. At t x , when the rate changes, the parameters of the inverse cumulative density function are changed to correspond to the first model-phase and the process continues until the transmission time of the current host is reached. In the rare instance where more than one sampled lineage exists at the time of infection, the existing lineages are randomly coalesced with zero length branches. Finally, each individual sub-tree is joined into a single viral genealogy according to the transmission history. The 4 model specifications introduced in the previous section were used for simulations until the "end" of each outbreak, i.e. when there are no infectives left. We compared outbreaks of similar final size and multiple realizations of virus genealogies for each transmission history. All simulations were implemented using the statistical software R [37] . Transmission trees and virus genealogies are complex objects. Therefore, in order to evaluate and compare them we used a number of summary statistics (Table 1) . These tree statistics include topology measures, branch length summaries, and lineages through time progression. The Sackin's index can be normalized according to a reference model (we used the Yule model) in order to obtain a statistic that does not depend on the tree size. Both Sackin's index and Colless index depend only on the topology of the tree, and they are invariant under isomorphisms and relabeling of leaves. They reach their maximum value at caterpillars (ladder-like trees), and their minimum on the maximally balanced trees. A binary tree is considered to be perfectly balanced if each internal node in the tree divides the leaves descending from it into two equally sized groups. The expected number of cherries in a tree with n taxa under a Yule model is n/3. In an asymmetric tree (more ladder-like tree), tips tend to coalesce with branches deeper in the tree, and there are fewer cherries than expected. The number of cherries and Sackin's index complement each other well, as the number of cherries captures asymmetry in the recent evolutionary past, while Sackin's index captures asymmetry over the entire evolutionary history of the sample. These two measures are only weakly correlated [15] . A high ratio of internal branch to external branch length occurs in 'star-like' trees. The tree height in a virus genealogy represents the time from the first infection to the last sampling event. Since epidemics progress at different speed on different networks, heterogeneities in tree heights are expected. Sackin's Index [38] The average number of splits or ancestors from a tip to the root of the tree. Colless Index [39] At each internal node, partition the tips that descend into groups of sizes r (to the right) and l (to the left), and compute the sum of absolute values |r − l| for all nodes. The number of clades with two taxa. The ratio between the mean external branch length (branch that ends with a sampling event) and the mean internal branch length (branches between coalescence events). The time from the first infection/coalescence (after the index case) to the last sampling event. Topological distance Twice the number of internal branches defining different bipartitions of the tips. Number of lineages through time [40] The number of lineages across the tree. If all infected individuals are sampled, it corresponds to the prevalence curve. Branch length growth rate over time The average branch length obtained at several time points, divided by the tree height evaluated at the same. doi:10.1371/journal.pcbi.1005316.t001 The topological distance was obtained as twice the number of internal branches defining different bipartitions of the tips. A topological distance that takes branch lengths into account was also considered (the sum of the branch lengths that need be erased to have two similar trees.) The number of lineages through time was normalized in time and by the maximum number of lineages [40] . We used the R package ape (Analyses of Phylogenetics and Evolution) [41] to create and plot the phylogenies and the package apTreeshape [42] for the evaluation of some tree statistics. To further investigate how well time-scaled phylogenies can estimate the epidemic process and identify the underlying contact network, we applied an inference framework for model selection and parameter estimation based on approximate Bayesian computation (ABC). ABC is a methodology to estimate model parameters replacing the likelihood function with a simulation-based procedure and a distance function to measure the similarity between simulated and observed data. Various ABC algorithms have been proposed, from the simple ABC-rejection [43] to ABC Markov chain Monte Carlo (MCMC) [44] and ABC based on sequential Monte Carlo (SMC) methods [23, 45] . Here, we use ABC-SMC as proposed by Toni et al. [23] because it addresses some of the potential drawbacks of previous ABC algorithms, such as slow convergence rate, by sampling from a sequence of intermediate distributions. The SMC sampler introduces a number of intermediate steps decreasing iteratively the tolerance threshold for samples acceptance. At the first iteration, N particles θ 0 (representing the parameters of interest) are generated from the prior distribution and data are simulated from the model based on θ 0 . The proposed parameters are accepted if the difference between the summary statistics of the simulated data D 0 and the observed data D is below the threshold 1 . At iteration t > 1, the particles are drawn from the previous population of the accepted samples at the iteration t − 1 (with threshold t−1 ) with slight perturbations. In our work, data (observed virus genealogy) and simulated trees are compared through the use of summary statistics which correspond to the above listed tree statistics ( Table 1) . The three network models M = {WS, ER, BA} were used to simulate outbreaks using the stage-varying infectivity profile with ratio 10:1 acute:chronic and patient infectivity variation (σ = 3). We assumed that network model and one network parameter were unknown. For ER, the network parameter of interest was the probability of drawing an edge between two arbitrary vertices; for BA it was the number of edges to add in each time step of the generating algorithm, and for WS it was the neighborhood within which the vertices of the lattice are connected. We also estimated the removal rate γ and the infection rate in the acute phase λ 1 (infection rates in the chronic and immuno-compromised stage can be obtained deterministically from the acute phase infection rate). Therefore, θ consists of 3 parameters for each type of network and they are model specific. All remaining parameters characterizing both the network structure and the epidemic process were considered known. The output of the algorithm were the approximations of the model M marginal posterior distributions P(M|D) which is the proportion of times that each model is selected in N samples, and the marginal posterior distributions of parameters P(θ|D, M) for the candidate models. We used a discrete uniform distribution from 1 to 3 as model prior π(M). We chose to decrease the tolerance values following an exponential decay such that t = 0 exp(−0.5t) where t is the current sequential step, as proposed in [46] . A pilot run of 100 simulations for each model in M was used to define the initial thresholds. Convergence was assumed when the acceptance rate of newly proposed particles had dropped below 1 in 100, and visual inspection of the posterior distribution showed no change in the last iterations. We found that convergence was achieved with T = 10 iterations and N = 1000 particles per iteration. The prior distributions on the parameters λ 1 and γ were Uniform (0.0001, 0.1) and (0.00025, 0.1), respectively. The computation time of the algorithm depended on the tree size and sampling fraction and it took between 1 and 2 hours in a parallel implementation on 8 processors. Further details of the algorithm and its computational cost can be found in S2 Text. We applied the ABC inference method to the analysis of two HIV-1 DNA sequence sets sampled from different IDU transmission epidemics in Sweden [47, 48] . To reconstruct the timescaled virus phylogenies from the DNA sequences we used a Bayesian Skyline coalescent model in BEAST 1.8 [49] . The general time reversible nucleotide substitution model was used with an uncorrelated log-normal relaxed clock and a discretized gamma distribution with four categories was used to model rate heterogeneity across the sequence. For the log-normal relaxed clock parameters, a uniform prior on the positive axis was assumed for the mean, and an exponential with mean 1/3 for the standard deviation. A Uniform prior on (0, 1) was used for the nucleotide frequencies. The MCMC was run for 10 million iterations, with a 10% burnin period and samples saved every 10000 iterations. We selected the maximum credibility tree and the negative branches were set equal to zero. To compare our simulation results, we used networks with the same mean degree (8) and a constant rewiring probability (ρ = 0.01) for the WS networks. The SIR-type models were characterized by a transmission rate of 0.01 per contact, a removal rate γ = 2.8 year −1 and sampling probability p = 1, unless specified otherwise. The within-host model generates virus genealogies that are consistent with a given transmission history, but not necessarily identical to it. An example of the impact of the within-host evolutionary process in a small size network/epidemic is shown in Fig 2. This figure shows a transmission history (A), its transmission tree representation (B), and four compatible virus genealogies (C-F). The genealogies display branch elongation/compression as compared to the transmission tree but also changes in topology. For instance, lineage 5, sampled in individual 5 infected by 2 soon after 2's own infection, appears consistently on the top part of the simulated virus genealogies and its branch can only be elongated by a small amount (because the pretransmission interval is small). On the other hand, lineage 10, infected later by 2, is located in different parts of the possible genealogies, thus indicating changes in the virus genealogy topology versus the underlying transmission tree. This happens because longer time implies an increase in the virus diversity in 2, i.e. more lineages are available in the donor. Therefore, as many virus trees are possible under any transmission history, it is important to evaluate the additional variation within-host diversity inflicts on the epidemiological inference. An epidemic can spread faster on ER and BA networks, thus the resulting transmission tree from a WS network includes longer times resulting in taller trees (Fig 3A) . This is mainly because WS has higher clustering than ER or BA. Both the unobservable transmission tree and the observable virus genealogy show the same tree height information. Other tree statistics, however, show different patterns of network discrimination based on transmission tree or virus genealogy. The proportion of cherries per taxa is slightly less informative on virus genealogies than on transmission trees (Fig 3A) . In particular, while there is a decrease for ER and WS (less balanced in virus genealogies than transmission trees), it increases for BA (more balanced in virus genealogies than transmission trees). A similar pattern is seen using Sackin's Index or Colless' Index (ER and WS less balanced in virus genealogy, BA more balanced (Fig 3D and 3E) ). Overall, differences between BA and WS become more evident in virus genealogies. Because Sackin's Index and Colless' Index are highly correlated we will only report Sackin's Index from now on. An epidemic spreading on an ER network is similar to a population random mixing model. Therefore, it is expected to generate a balanced transmission tree [13, 15] . The average number of people infected by an individual in the ER network show little variance and therefore the within host evolution model will add some heterogeneity producing small changes in the tree topology leading to an increase in the unbalancedness in the resulting virus genealogy. In BA models, instead, the presence of superspreaders generate transmission histories that are very unbalanced [13] . When a donor infects two or more recipients within a short interval, the order of transmissions along with infection times become impossible to accurately reconstruct; all splits are within the donor, describing within-host evolution in the donor (also shown in [21] ). Overall, the pretransmission interval associated with each and every transmission is a random draw from the possible coalescence times in the donor's viral population. Therefore, in BA networks the virus genealogy will show larger changes in the tree topology with respect to the underlying transmission trees and result in more balanced trees. The ratio of the mean internal to external branch lengths is informative about the type of network (smallest for BA, higher for ER, highest for WS). While the trends were similar in transmission trees and virus genealogies, the expected ranges overlapped for ER and WS in transmission trees, and virus genealogies showed generally smaller ratios (Fig 3C) . Branch lengths increase linearly as a function of tree height during epidemic spread on both ER and BA networks. Deviations from linearity are observed for epidemic spread on WS (S3 Fig). At the end of an epidemic, the mean branch length is constant among networks but longer in virus genealogies rather than in transmission trees (Fig 3F) . Overall, trees from ER and WS networks are more imbalanced based on virus genealogies. However, as an epidemic spreads much faster in a BA network, the resulting virus genealogy will instead become more balanced because the virus does not have time to evolve time-structure between transmission events. Infectivity is known to vary across HIV-1 pathogenesis [50, 51] . Thus, rather than assuming a constant transmission rate throughout an infected person's disease stages, we tested 7 different infectivity profiles varying the ratio between the acute and chronic transmission rates and measured how they affected network model discrimination. The transmission rate in the pre-AIDS stage was held constant. Tree height becomes much less informative of network type the bigger the difference is between acute and chronic stage infectivity (Fig 4A) . This is because higher acute stage infectivity causes more infections in the acute phase and consequently the epidemic spread is faster. Since infection happens so rapidly, the external branches become very long compared to the internal branches, as all internal nodes are pushed to the root the higher the ratio between acute and chronic stage infectivity. Therefore, the total tree height is dominated by the external branch lengths (which are on average 1/γ). Similarly, mean internal over external branch lengths, which was an important index when constant infectivity was assumed, is less informative if we assume high acute/chronic stage infectivity ratios. Differences observed between ER and BA assuming a constant infectivity profile diminish (Fig 4D) . Hence, branch length and tree height measures are less informative of network type when differences in acute-chronic infectivity are considered. Topological tree measures, i.e., cherries per taxa, and Sackin's Index, were less affected by differences in acute-chronic infectivities ( Fig 4B and 4C ). Both these measures, calculated on the possible virus genealogies, still informed about the underlying contact network structure that HIV spread upon. The next stage of introducing realistic host evolutionary dynamics is to model patient specific differences. We did that by introducing variability in the overall infectivity level while keeping the acute-chronic ratio at 10:1 (σ = 0, 3, 10). While it was clear that introducing a non-constant infectivity profile diminished genealogical differences between underlying contact network structures, it was less obvious what effect introducing between-patient variation had (Fig 5) . While tree height remained with no power to discriminate between networks, internal to external branch length ratios became more discriminative (BA had lowest ratio, ER intermediary, and WS high). Individual variability seems to affect tree symmetry near the root more than towards the tips, since the Sackin's Index shows much more variation than the number of cherries per taxa. However, they both improved their power to discriminate between contact network structures, and Sackin's Index could differentiate WS from BA and ER networks. Thus, these simulations showed that the complex interactions that affect the resulting tree statistics when multiple levels of variability interact (within-host coalescence process, timing of infections relative to disease-and epidemic-stage, disease-stage infectivity differences, and patient individual variation), may induce non-trivial dynamic patterns. If no influx of susceptibles occurs, the mean branch length increases as trees grow taller because it takes longer time to find uninfected hosts later in an epidemic (Fig 6A and 6B , see also S4A Fig) . At 100% sampling of infecteds at any time during an epidemic, the mean branch length increases as a function of total number of sampled infecteds (number of taxa, Fig 6B) . BA typically produces shorter tree branches than ER and WS as more individuals are sampled. Thus, if it were possible to sample everyone at time of infection, then the trend of adding longer tips towards the end of the epidemic becomes more pronounced (Fig 6A) . The internal to external branch length ratios typically decrease as the epidemic progresses (S4B Fig) . This is mainly explained by the depletion of susceptible neighbors for individuals infected late in an epidemic, thus generating very long external branches. In addition, branches added later in the epidemic, resulting from chronic donors, divide already existing branches into shorter segments. BA trees show lower ratios than ER and WS throughout the epidemic, but WS and ER are less distinguishable during an epidemic. On the topological level, the Sackin's Index typically decreases as an epidemic matures (Fig 6C) . At the end of an epidemic (Fig 3) , BA and ER show more unbalanced trees throughout an epidemic and the most imbalanced trees come from WS networks (Fig 6C and 6D) . Simulations on networks of size 5000 show similar results: for comparison, see Fig 6 with S5 Fig and S4 Fig with S6 Fig. Thus , while these statistics are indicative of the underlying contact network, they are confounded by epidemic stage and the size of the susceptible population. Consequently, to be able to infer the underlying contact While a genealogical tree grows as an epidemic matures, the sampling fraction has no real effect on mean branch length, albeit with smaller sample fractions the estimation becomes somewhat more uncertain due to stochastic effects (S7 Fig). Interestingly, lower sampling fraction increases mean branch lengths derived from any underlying contact network (S7 Fig). This happens because the remaining branches in the genealogy represent increased numbers of infected hosts. However, this effect does not cause additional confusion over that caused by epidemic stage, as the differences between BA, ER, and WS networks are distinct at all epidemic stages and number of infected. On the other hand, we do not usually know at what stage an epidemic is (i.e., number of actually infected) but only the number of sampled hosts. The mean branch length as a function of number of taxa (Fig 7) could mislead the inference of underlying contact network, especially for small sample fractions. In fact, any branch length or tree height index would be affected by mistaking number of sampled hosts with stage of the epidemic because the number of infected grows faster than the number of sampled early in an epidemic. The topological indices were also affected by sampling fraction. While general trends (Fig 3) remain constant through the cumulative number of samples over an epidemic, it is again important to know at what stage an epidemic is at time of sampling. Similar to branch length indices, topological indices can be misleading if sampling faction and stage of the epidemic are unknown. We illustrate the performance of the ABC inference on 100 simulated viral genealogies for each network type of size 1000. The parameters were chosen so that the mean degree of each network type was 8, the diagnosis rate was 2.8 years −1 (derived from the average time from seroconversion to diagnosis in Sweden estimated in [52] ), and infection rate in the acute phase was λ 1 = 0.005. The sampling probability p was set at 0.5 because the data from the general HIV Swedish epidemic have coverage of around 50% [52] [53] [54] . To investigate model selection performance of the ABC algorithm, we record the number of times that the true model has the highest posterior model probability P(M|D) among the three models for the 100 simulated datasets. The algorithm was able to discriminate among the network models quite well. For the first network model (ER), in 78 out of the 100 simulated datasets, the true model had the highest posterior probability among the 3 different network types. For the second model (BA), similar results were obtained; 76 out of the 100 simulated datasets identified BA. Outbreaks on the WS network were misclassified only 1 time out of 100. The corresponding network parameters were estimated reasonably well in most cases ( Table 2) . For illustration, we report the results of a randomly chosen experiment where the observed data were obtained from an epidemic spread on an ER network ( Table 2 ). The parameters to estimate are the mean degree, diagnosis and infection rate for an outbreak on an ER network. The estimation of the removal (i.e. diagnosis) rate was sometimes skewed towards the upper bound, which probably is due to branch elongation induced by the within-host evolution model. We only estimate three parameters per network/epidemic model. In principle, it would be possible to add the rewiring probability, ρ, of the WS network in the ABC inference and estimate it. However, the rewiring probability of the WS networks turns out to be quite difficult to identify in the model choice setting. This is because for large values of ρ a WS network becomes indistinguishable from a ER network S1 Table. At ρ < 0.1 there was typically still a good chance (P(M = m|D) = 0.70 − 0.95) to identify the correct network structure m. Inference of epidemic parameters as well as network type becomes more complex in real outbreaks. We consider two genealogies from separate IDU-associated HIV-1 CRF01 and subtype B epidemics in Sweden, respectively. The CRF01 tree was sampled from a rapid outbreak that was imported from Finland [48] around 2003, which was quiescent until the outbreak started in 2006. The subtype B tree was sampled from the more slowly, and typical, spreading IDU epidemic in Sweden [47] . While tree indices were different between the trees from the Swedish HIV-epidemic (Fig 8) , and superficially in line with what one might expect comparing an outbreak scenario to a more endemic situation, e.g., mean branch lengths were 279 and 913, and tree height 4176 and 10527, respectively, they cannot be directly compared because these trees represent different stages in the respective epidemic. Furthermore, real data is rarely sampled at 100% of all infected or even diagnosed, so comparisons to our simulated overall network differences are difficult to evaluate. Thus, to evaluate genealogies from real epidemics we must consider epidemic stage and sampling fraction (Figs 6 and 7) . In the ABC analysis of the two IDU HIV-1 transmission chains among IDU we have assumed the same epidemiological model as in the simulations (stage-varying infectivity profile with ratio 10:1 acute:chronic, patient infectivity variation (σ = 3)) and a sampling fraction of 50%. For the CRF01 IDU outbreak, the susceptible population was assumed to be 200 and in the Swedish subtype B ongoing epidemic it was set to 3000. Results are shown in Table 3 . Overall, convergence was more difficult to achieve in the analysis of the real data and the tolerance levels had to be set to higher values than in the simulation studies. However, the posterior model probabilities seem to indicate that the two outbreaks display different associations to the three network models considered even though there is no single model (among the three network models considered) that can be used to appropriately describe each outbreak. For the CRF01 outbreak, there is weak evidence (45%) in favor of the BA network type although the ER was also supported with a posterior probability of 39%, with small differences in the parameter estimates. The infection rate in the acute phase was λ 1 = 0.018(0.009, 0.031) vs λ 1 = 0.023(0.0015, 0.036), γ = 0.001(0.0006, 0.002) vs γ = 0.001(0.0005, 0.002) and the mean degree was 3.2(2.4, 3.6) vs 3.7(2.2, 4.1) in the BA and ER respectively. The HIV-1 subtype B outbreak was mostly (57%) associated with an ER network type. However, there was considerable uncertainty in the parameter estimates: λ 1 = 0.0025(0.001, 0.004), γ = 0.0003(0.0003, 0.001) and the mean degree 1.4(1.3, 2.1). In this study we addressed several outstanding factors that could affect HIV phylogenetic tree shape in addition to the underlying contact network upon which HIV spreads. While previous studies have evaluated how contact networks affect the resulting tree [13, 14, 16] , they ignored differences between transmission trees and virus phylogenies, varying infectivity over disease progression, among patient infectivity variation, sampling fraction, and epidemic stage. Here, we show that all these factors put further restrictions on what type of phylogeny one can expect, but also that these additional factors may confound the inference of contact network. Transmission histories are not perfectly reconstructed by virus phylogenies. In fact, it has been previously shown that virus phylogenies have a time bias that elongates external branches, shifts internal nodes backwards, and may cause lineage disordering relative to the transmission history [19] . In this study we account for these factors by sampling (many) possible virus genealogies from a transmission history using a recently developed within-host coalescent model [21] . Because many virus genealogies may be consistent with one single transmission history, one would expect this factor to add uncertainty to the network inference. However, there is also added signal about transmission times because the within-host diversity changes over the time of infection. Thus, because the degree distributions are different for each network type (Fig 1) , transmissions happen after different lengths of infection time, which affects the phylogenetic tree shape. Indeed, we show that the contact network inference from virus genealogies can be quite different than that from transmission trees, and that tree balance differences in fact may be more informative using virus phylogenies. Besides, transmission histories or trees can never be observed, or only partially and then with great uncertainty, which is the main reason for turning to phylogenetic reconstruction in the first place. It is well known that infectivity is not constant over disease progression, albeit the literature is uncertain about how big the difference is between acute and chronic stage infectivity [51, 55] . Indeed, we find that varying infectivity affects the expected phylogeny under different contact networks. In fact, this factor alone seems to diminish phylogenetic differences between contact networks. Somewhat surprisingly, patient variation in infectivity works in the opposite direction, i.e., it seemed to amplify differences in the contact network structures. The result is that virus genealogies do carry a signal of what type of contact network HIV spread upon, but the expectations are different than what one would expect from a naïve model where no virus diversity exists and all hosts are described by an identical constant infectivity over their pathogenesis. We do not investigate systematically if there is one factor that explains the expected genealogies simulated under the different network assumptions. Rather, the complex dynamic interaction of the heterogeneity factors included implies that a single cause for the differences observed may not exist. However, we compare only epidemic spread on three network types with the same mean degree and we used tree statistics to assess the differences. In certain settings, ER and BA behaved similarly (and differently from WS). In this case, it is likely that the difference is due to clustering, since there is no large difference between the degree distributions in the ER and WS, whilst WS is the only network (among the three considered) allowing clustering. In some other scenarios, WS and ER behaved more similarly (and differently from BA). This is most likely due to the different degree distributions. We show that any tree index that one would measure is affected by sampling fraction and the stage of the epidemic. We show that phylogenies cannot be meaningfully interpreted without this additional knowledge, as tree statistics otherwise may mislead the inference of contact network. While our results relate to epidemic situations relevant to HIV epidemics, they may also be relevant to other measurably evolving pathogens such as hepatitis C and influenza. The developed ABC inference framework for network identification and parameter estimation showed discriminatory power and ability to recover epidemiological parameters when applied to simulated data. The model used for validating the ABC algorithm included stage varying infectivity, individual, and within-host variability. For complex models such as epidemic spreads on networks the likelihood function is computationally costly to evaluate and ABC offered a way to perform likelihood-free statistical inference. Furthermore, the use of summary statistics allowed us to study the relationship between readily measurable tree statistics and complex transmission dynamics. The analysis of the two outbreaks from the Swedish HIV-1 epidemic showed that inference on real datasets is typically much harder. As is to be expected, real world networks do not match perfectly with the simplified models considered in this study, that were chosen for comparability with previous studies [13, 14, 16] . In fact, in the ABC algorithm, the proposed parameters values are accepted if the simulated data based on them are close enough to the observed data. If the observed data were generated from a rather different or more complex model, then the simulated data from the candidate model probably will be far away from the observed data. Hence, very few proposed parameter values will be accepted. More realistic models, such as dynamic networks, may be able to better capture the features of the outbreaks, especially those occurring over a long period of time. Another class of network models that could be suitable in modelling these outbreaks is the configuration model which is flexible and has been studied extensively in the literature, or exponential random graph models (although in general slower to fit to data) which include a broader spectrum of degree distributions and clustering levels rather than the three simple networks considered here. The ABC inference scheme can also be extended to take into account uncertainty in the phylogenetic reconstruction, as shown in [56] . Each summary statistics calculated on the simulated trees would then be compared to the distribution of the same statistics calculated on the posterior distribution of virus genealogies. Romero et al. [56] apply this procedure investigating transmission between a heterosexual couple. However, the outbreaks we are analyzing in this paper are bigger in size and the whole algorithm would be more computationally expensive. Lastly, this work could be further extended to integrate other sources of network data coming from social surveys and/or public health intervention studies, as recently outlined in a review paper by [57] to improve network analysis in HIV epidemiology. The length of the acute phase was assumed constant, t 1 = 30 days while β was assumed to be 1/8 year −1 . We do not assume a length for the AIDS phase a priori, but if an individual reaches the third stage, he will stay in the third phase until he is diagnosed or until death occurs. (TIF) S2 Fig. Within-host evolution model. The effective viral population size is modelled as a two piece linear function: first, it grows at rate β 1 until a random peak time t x , allowed to vary among individuals between t a and t b . After t x , the viral population size decreases or stabilizes at a rate β 2 . The dashed line represents one possible realization. This figure is part of [21] .

projects that include this document

Unselected / annnotation Selected / annnotation